Fasttext model used for filtering in DataComp-LM to produce DCLM-Baseline.

The model classifies between __label__hq and __label__cc which correspond to "high-quality" (i.e., OH2.5 and Reddit ELI5 data) and "low-quality" (i.e., web-crawled data from Common Crawl) respectively. We use the score given to __label__hq to filter our documents via a percentile-based threshold.

See our dclm repo for documentation about how we applied to to filter data in our experiments.

See fasttext documentation for general documentation on fasttext classifiers and how to use them with python.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .