--- pipeline_tag: text-classification library_name: fasttext ---

📑 Paper    |    🔨 fastText Classifier    |    🤗 Released Dataset    |    📦 Repo

## Model Summary This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively. ## How to use You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply: ```python import os import argparse from pathlib import Path parser = argparse.ArgumentParser("Filter") parser.add_argument("--input_path",type=str, help="input path name") parser.add_argument("--output_path",type=str, help="output name") args = parser.parse_args() from datatrove.executor import LocalPipelineExecutor from datatrove.pipeline.filters import FastTextClassifierFilter from datatrove.pipeline.readers import ParquetReader,JsonlReader from datatrove.pipeline.writers.jsonl import JsonlWriter Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True) dist_executor = LocalPipelineExecutor( skip_completed=False, pipeline=[ JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}), FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), JsonlWriter(f"{args.output_path}", compression=None) ], tasks=100, ) dist_executor.run() ``` ## Training For more training details, you can refer to the paper and the training code is available on GitHub [PreSelect](https://github.com/hkust-nlp/preselect). ## Citation If you find this work helpful, please kindly cite as: ``` @article{shum2025predictivedataselectiondata, title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He}, journal={arXiv preprint arXiv:2503.00808}, year={2025}, eprint={2503.00808}, } ```