Papers
arxiv:2312.17120

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Published on Dec 28, 2023
· Submitted by akhaliq on Dec 29, 2023
#3 Paper of the day
Authors:
,

Abstract

High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``less is more'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

Community

Paper author

Hi, our data is now open source on https://huggingface.co/datasets/GAIR/MathPile.

We sincerely hope to receive your feedback and suggestions for this work, including but not limited to feedback on data quality, comments on the paper, and discussions on technical details, among other aspects. Please feel free to leave any comments below.

Why CC-BY-NC-SA?

Paper author

hi, due to some documents being licensed for non-commercial use, we have released our work under the CC BY-NC SA 4.0 license. Are you looking for a dataset that is more friendly for commercial use?

I would recommend it. Creative Commons has great licenses, but I would highly consider to allow for commercial use also.

Paper author

Thanks for your feedback. The version for commercial use is coming soon! :)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Paper author

hi all, the commercial-use version of MathPile is coming. We are considering the licenses for commercial use. Any recommendations?

Paper author

hi all, the commercial-use version of MathPile is out, which is available at https://huggingface.co/datasets/GAIR/MathPile_Commercial

Feel free to train your models and build (commercial) applications. Any feedback is welcome.

·

When I was trying to preprocess the data I encountered a Key error.

This code was working;
from datasets import load_dataset
import re

def preprocess_latex(document):
# Your preprocessing function here
document = re.sub(r'%.*', '', document)
document = re.sub(r'\[a-zA-Z]+({[^}]*})?', '', document)
document = re.sub(r'\s+', ' ', document).strip()
return document

Preprocess and write to a temporary file

temp_file_path = "temp_preprocessed_dataset.txt"
with open(temp_file_path, 'w', encoding='utf-8') as f:
for example in dataset:
# Assuming 'text' is the key containing LaTeX content; adjust if necessary
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')

It gave me a temp_preprocessed_dataset.txt file with 17gb of data, and then it just automatically stopped and produced this:


KeyError Traceback (most recent call last)
Cell In[11], line 16
13 with open(temp_file_path, 'w', encoding='utf-8') as f:
14 for example in dataset:
15 # Assuming 'text' is the key containing LaTeX content; adjust if necessary
---> 16 preprocessed_text = preprocess_latex(example['text'])
17 f.write(preprocessed_text + '\n')

KeyError: 'text'

I wrote an error file using this to log the problematic entries to a separate file, that file was 1.02gb
with open(temp_file_path, 'w', encoding='utf-8') as f, open('errors_log.txt', 'w', encoding='utf-8') as error_log:
for example in dataset:
if 'text' not in example:
# Log the problematic example for further investigation
error_log.write(str(example) + '\n')
continue
preprocessed_text = preprocess_latex(example['text'])
f.write(preprocessed_text + '\n')

When I go to that page, I still see references to CC BY-NC-SA 4.0, which is a non-commercial license.

" You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By using this data, you agree to comply with the original usage licenses of all sources contributing to MathPile. If the source data of this dataset is subject to a more restrictive license than CC BY-NC-SA 4.0, then this dataset conforms to that more stringent licensing. In all other scenarios, it is governed by the CC BY-NC-SA 4.0 license. Access to this dataset is granted automatically once you accept the license terms and complete all the required fields below.

By agreeing you accept to share your contact information (email and username) with the repository authors."

Paper author

Sorry for the late reply and the confusing README. It has been fixed. Welcome to check it.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2312.17120 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.17120 in a Space README.md to link it from this page.

Collections including this paper 9