Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Abstract
Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these rules lack the flexibility to address the unique characteristics of individual example effectively. Meanwhile, applying tailored rules to every example is impractical for human experts. In this paper, we demonstrate that even small language models, with as few as 0.3B parameters, can exhibit substantial data refining capabilities comparable to those of human experts. We introduce Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, enabling models to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experimental results show that models pre-trained on ProX-curated data outperform either original data or data filtered by other selection methods by more than 2% across various downstream benchmarks. Its effectiveness spans various model sizes and pre-training corpora, including C4, RedPajama-V2, and FineWeb. Furthermore, ProX exhibits significant potential in domain-specific continual pre-training: without domain specific design, models trained on OpenWebMath refined by ProX outperform human-crafted rule-based methods, improving average accuracy by 7.6% over Mistral-7B, with 14.6% for Llama-2-7B and 20.3% for CodeLlama-7B, all within 10B tokens to be comparable to models like Llemma-7B trained on 200B tokens. Further analysis highlights that ProX significantly saves training FLOPs, offering a promising path for efficient LLM pre-training.We are open-sourcing ProX with >100B corpus, models, and sharing all training and implementation details for reproducible research and future innovation. Code: https://github.com/GAIR-NLP/ProX
Community
Excited to share our latest research project ProX!🫐
Enjoy our initial release with all the data and model checkpoints🤗!
- Data: 👉 see this collection
Our general models are trained on <50B tokens to be comparable to Tinyllama and OLMo!
- General Models Pre-Trained from Scratch: https://huggingface.co/collections/gair-prox/prox-general-models-65f1674f0607712c4d6eec76
Our math models improve up to 20% within 10B training tokens!
- Continual Pre-trained Math Models: https://huggingface.co/collections/gair-prox/prox-math-models-66e92c3e5d54b27612286eb9
Other artifacts:
- Training Codebase: https://github.com/GAIR-NLP/ProX
- X threads: https://x.com/FaZhou_998/status/1839154742439850131
- Project Page https://gair-nlp.github.io/ProX
We've released:
- the codebase for large scale refining at: https://github.com/GAIR-NLP/ProX
- together with the refining models at: https://huggingface.co/collections/gair-prox/prox-refining-models-6707cf820a16d830fbf434dd.
cool interesting work
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Discriminative Tasks by Guiding the Pre-trained Language Model with Large Language Model's Experience (2024)
- Maximizing V-information for Pre-training Superior Foundation Models (2024)
- AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies (2024)
- PAT: Pruning-Aware Tuning for Large Language Models (2024)
- A Few-Shot Approach for Relation Extraction Domain Adaptation using Large Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 16
Browse 16 models citing this paperDatasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper