Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Abstract
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.
Community
Hi, our data is now open source on https://huggingface.co/datasets/GAIR/MathPile.
We sincerely hope to receive your feedback and suggestions for this work, including but not limited to feedback on data quality, comments on the paper, and discussions on technical details, among other aspects. Please feel free to leave any comments below.
Why CC-BY-NC-SA?
hi, due to some documents being licensed for non-commercial use, we have released our work under the CC BY-NC SA 4.0 license. Are you looking for a dataset that is more friendly for commercial use?
I would recommend it. Creative Commons has great licenses, but I would highly consider to allow for commercial use also.
Thanks for your feedback. The version for commercial use is coming soon! :)
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Oasis: Data Curation and Assessment System for Pretraining of Large Language Models (2023)
- Ziya2: Data-centric Learning is All LLMs Need (2023)
- Paloma: A Benchmark for Evaluating Language Model Fit (2023)
- YUAN 2.0: A Large Language Model with Localized Filtering-based Attention (2023)
- YAYI 2: Multilingual Open-Source Large Language Models (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
hi all, the commercial-use version of MathPile is coming. We are considering the licenses for commercial use. Any recommendations?
hi all, the commercial-use version of MathPile is out, which is available at https://huggingface.co/datasets/GAIR/MathPile_Commercial
Feel free to train your models and build (commercial) applications. Any feedback is welcome.
When I was trying to preprocess the data I encountered a Key error.
This code was working;
from datasets import load_dataset
import re
def preprocess_latex(document):
# Your preprocessing function here
document = re.sub(r'%.*', '', document)
document = re.sub(r'\[a-zA-Z]+({[^}]*})?', '', document)
document = re.sub(r'\s+', ' ', document).strip()
return document
Preprocess and write to a temporary file
temp_file_path = "temp_preprocessed_dataset.txt"
with open(temp_file_path, 'w', encoding='utf-8') as f:
for example in dataset:
# Assuming 'text' is the key containing LaTeX content; adjust if necessary
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')
It gave me a temp_preprocessed_dataset.txt file with 17gb of data, and then it just automatically stopped and produced this:
KeyError Traceback (most recent call last)
Cell In[11], line 16
13 with open(temp_file_path, 'w', encoding='utf-8') as f:
14 for example in dataset:
15 # Assuming 'text' is the key containing LaTeX content; adjust if necessary
---> 16 preprocessed_text = preprocess_latex(example['text'])
17 f.write(preprocessed_text + '\n')
KeyError: 'text'
I wrote an error file using this to log the problematic entries to a separate file, that file was 1.02gb
with open(temp_file_path, 'w', encoding='utf-8') as f, open('errors_log.txt', 'w', encoding='utf-8') as error_log:
for example in dataset:
if 'text' not in example:
# Log the problematic example for further investigation
error_log.write(str(example) + '\n')
continue
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')
When I go to that page, I still see references to CC BY-NC-SA 4.0, which is a non-commercial license.
" You need to agree to share your contact information to access this dataset
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By using this data, you agree to comply with the original usage licenses of all sources contributing to MathPile. If the source data of this dataset is subject to a more restrictive license than CC BY-NC-SA 4.0, then this dataset conforms to that more stringent licensing. In all other scenarios, it is governed by the CC BY-NC-SA 4.0 license. Access to this dataset is granted automatically once you accept the license terms and complete all the required fields below.
By agreeing you accept to share your contact information (email and username) with the repository authors."
Sorry for the late reply and the confusing README. It has been fixed. Welcome to check it.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper