metadata

license: mit
datasets:
  - kenhktsui/llm-data-quality
language:
  - en
library_name: fasttext
pipeline_tag: text-classification

llm-data-textbook-quality-fasttext-classifier-v1

Model is built on fasttext. It is an optimised version of llm-data-textbook-quality-classifier-v1.
Not just it results in a higher F1 score, but also it can classify more than 2000 examples per second in CPU.
This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
Please note textbook quality is a subset of high quality.

Model Performance

Dataset	F1 Score
Train	0.8695
Test	0.8485

Usage

from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext


model = fasttext.load_model(hf_hub_download("kenhktsui/llm-data-textbook-quality-fasttext-classifer-v1", "model.bin"))


def replace_newlines(text: str) -> str:
  return re.sub("\n+", " ", text)


def predict(text_list: List[str]) -> List[dict]:
  text_list = [replace_newlines(text) for text in text_list]
  pred = model.predict(text_list)
  return [{"label": l[0].lstrip("__label__"), "score": s[0]}
           for l, s in zip(*pred)]


predict(["Hi"])
# Output: [{'label': 'LOW_QUALITY', 'score': 1.00001}]

Benchmark

Dataset	Sampling	Average Quality Score
nampdn-ai/tiny-orca-textbooks	Full	0.8350
nampdn-ai/tiny-textbooks	Full	0.7535
SciPhi/textbooks-are-all-you-need-lite	Full	0.7202
vikp/textbook_quality_programming	Full	0.5447
BEE-spoke-data/fineweb-100k_en-med	Full	0.4754
pszemraj/simple_wikipedia_LM	Full	0.4704
mattymchen/refinedweb-3m	Full	0.2963
JeanKaddour/minipile	Full	0.2562

Average Quality Score is defined as the average probability output of HIGH_QUALITY. The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model. Wikipedia scores lower because it is not textbook after all. Web scores the lowest.