Fasttext model used for filtering in DataComp-LM to produce DCLM-Baseline.
The model classifies between __label__hq
and __label__cc
which correspond to "high-quality" (i.e., OH2.5 and Reddit ELI5 data) and "low-quality" (i.e., web-crawled data from Common Crawl) respectively. We use the score given to __label__hq
to filter our documents via a percentile-based threshold.
See our dclm repo for documentation about how we applied to to filter data in our experiments.
See fasttext documentation for general documentation on fasttext classifiers and how to use them with python.