Low Latency CPU Based Educational Value Classifier With Generic Educational Value

Community Article Published June 12, 2024

Ken Tsui, Huu Nguyen, Ontocord.AI

1. Motivation

There is an emerging trend where language models like Phi-3[1], Llama3[2], Mistral-7B[3] are getting smaller, while smarter. In particular, in Phi-3 Technical Report[1], "Data Optimal Regime" was introduced to focus on the quality of data, in construct with "Compute Optimal Regime" whose focus is optimal model size and number of tokens for training.

Inspired by Textbooks Are All You Need[4], where a classifier was developed to predict the educational value of data on code dataset, and was then used for data filtering, which significantly boosted model performance, our motivation is to build a lightweight classifier that can predict education value of any document from the web.

Our contributions are:

  • ⚡release of low latency CPU based educational value classifier ("the classifier") that filters pretraining dataset to attain better LLM performance with same training token and has a loosely defined/ generic educational value;
  • ⚡release of fineweb-edu-fasttext-classifier based on training dataset HuggingFaceFW/fineweb-edu-llama3-annotations that has explicitily defined educational value;
  • 📊detailed analysis of educational value annotation difference between two classifiers due to difference in prompt;
  • 🔎exploration of possibility of using educational value classifier to evaluate pretraining dataset before pretraining and mine domain with high educational value in internet, thanks to its low latency.

2. Dataset Construction

To construct a dataset to train the classifier, we have to ensure the diversity of the dataset, while keeping a limited compute budget. Phi-3-mini-128k-instruct[1] is used because it is compute efficient and it demonstrates great performance in reasoning and language understanding given its small model size. MiniPile[5] is used as training and testing dataset because it was constructed with clustering and human guided exclusion. It demonstrates minimal performance degradation in GLUE, despite only training with 1 million documents.

Prompt:

Task: Classify if the provided context has High or Low educational value for a student. Label is either High or Low.


Context: {text}
Label:<|end|>
<|assistant|>

We do not define educational value to be more explicit because of its subjectivity involved, and also we are not certain about the capability of small language models.

The logits of continuation of “High” and “Low” token are used to frame a binary classification problem. P(High Educational Value) = Logit("High")/(Logit("High")+Logit("Low"))

Afterwards, probability is used to create 3 labels, as it offers higher granularity of educational value.

  • High (Top 25% educational value)
  • Mid (Middle 25-75% educational value)
  • Low (Bottom 25% educational value) During inference, the calculation of educational value is as follow: Educational Value = 2 * P(High) + 1 * P(Mid) + 0 * P(Low)

3. Model Training

fastText[6], where word representations are averaged and fed to linear layers for classification, is chosen as modeling approach as it is fast enough to handle pretraining data with billion and trillion tokens.

4. Evaluation

4.1 Classifier Evaluation

As the classifier is used to rank text data, rather than to classify data, Spearman rank-order correlation coefficient is measured. The coefficient between Educational Value and that of test data is 0.7055, indicating a strong monotonic relationship.

image/png

4.2 Analysis

4.2.1 Manual Inspection

predict_education_value(['''Logic is the study of correct reasoning. It includes both formal and informal logic. Formal logic is the study of deductively valid inferences or logical truths. It examines how conclusions follow from premises due to the structure of arguments alone, independent of their topic and content. Informal logic is associated with informal fallacies, critical thinking, and argumentation theory. It examines arguments expressed in natural language while formal logic uses formal language. When used as a countable noun, the term "a logic" refers to a logical formal system that articulates a proof system. Logic plays a central role in many fields, such as philosophy, mathematics, computer science, and linguistics.'''])
# Output [1.9266871362924576]
predict_educational_value(['''"Attention Is All You Need" is a landmark[1][2] 2017 research paper authored by eight scientists working at Google, responsible for expanding 2014 attention mechanisms proposed by Bahdanau et al. into a new deep learning architecture known as the transformer. The paper is considered by some to be a founding document for modern artificial intelligence, as transformers became the main architecture of large language models.[3][4] At the time, the focus of the research was on improving Seq2seq techniques for machine translation, but even in their paper the authors saw the potential for other tasks like question answering and for what is now called multimodal Generative AI.[5]'''])
# Output [1.8226698189973831]
predict_educational_value(['''A large language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process.[1] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[2]'''])
# Output [1.7609568238258362]
predict_educational_value(['''In Vapnik–Chervonenkis theory, the Vapnik–Chervonenkis (VC) dimension is a measure of the size (capacity, complexity, expressive power, richness, or flexibility) of a class of sets. The notion can be extended to classes of binary functions. It is defined as the cardinality of the largest set of points that the algorithm can shatter, which means the algorithm can always learn a perfect classifier for any labeling of at least one configuration of those data points. It was originally defined by Vladimir Vapnik and Alexey Chervonenkis.[1]'''])
# Output [1.589950144290924]
predict_educational_value(['''The query vector is compared (via dot product) with each word in the keys. This helps the model discover the most relevant word for the query word. In this case "girl" was determined to be the most relevant word for "that". The result (size 4 in this case) is run through the softmax function, producing a vector of size 4 with probabilities summing to 1. Multiplying this against the value matrix effectively amplifies the signal for the most important words in the sentence and diminishes the signal for less important words.[5] The structure of the input data is captured in the Wq and Wk weights, and the Wv weights express that structure in terms of more meaningful features for the task being trained for. For this reason, the attention head components are called Query (Wq), Key (Wk), and Value (Wv)—a loose and possibly misleading analogy with relational database systems.'''])
# Output [1.4657384157180786]
predict_educational_value(['''The Arsenal Football Club (commonly known as simply Arsenal) is an English professional football club based in Holloway, North London. Arsenal compete in the Premier League, the top flight of English football. In domestic football, Arsenal has won 13 league titles (including one unbeaten title), a record 14 FA Cups, two League Cups, 17 FA Community Shields, and a Football League Centenary Trophy. In European football, they have one European Cup Winners' Cup and one Inter-Cities Fairs Cup. In terms of trophies won, it is the third-most successful club in English football.[2]'''])
# Output [1.1015518307685852]
predict_educational_value(['''The 2003–04 season was Arsenal Football Club's 12th season in the Premier League and their 78th consecutive season in the top flight of English football.[3][4] It began on 1 July 2003 and concluded on 30 June 2004, with competitive matches played between August and May. The club ended the Premier League campaign as champions without a single defeat – a record of 26 wins and 12 draws. Arsenal fared less well in the cups, eliminated in the FA Cup and League Cup semi-finals to Manchester United and Middlesbrough respectively, and at the quarter-final stage of the UEFA Champions League to Chelsea.'''])
# Output [1.0146622359752655]
predict_educational_value(['''As both teams' first-choice kits featured a shade of red, Arsenal wore their yellow away strip, while Barcelona wore their traditional blue and maroon striped kit. Arsenal won the coin toss and Barcelona kicked off.[21] Barcelona almost immediately came under pressure when Thierry Henry shot straight at Barcelona goalkeeper Víctor Valdés, who conceded a corner. From the resulting corner Arsenal had another chance again courtesy of Henry, whose shot was again saved by Valdés. The next attack in the seventh minute resulted in Arsenal goalkeeper Jens Lehmann saving from Ludovic Giuly after he shot from a narrow angle. Four minutes later Barcelona were awarded a free-kick 35 yards from goal; Ronaldinho shot wide of the goal.'''])
# Output [0.7897453680634499]

From manual inspection, it can be noted that the model does like scientific knowledge. It is also interested in Arsenal as a football club, however, it does not think a summary of a particular match has good educational value. The fact that a document from Wikipedia does not indicate that it has high educational value.

4.2.3 Model Training With and Without Classifier

To validate that filtering with the classifier can lead to better performance at same training token, twp 192M models were trained with 6000 global steps.

Task Training on FineWeb With Filtering Training on FineWeb Without Filtering Training with Cosmopedia
arc-easy 37.37 34.97 37.45
arc-challenge 23.55 22.95 23.21
Hellaswag 28.02 27.92 27.78
MMLU 24.71 23.94 24.65
TruthfulQA 45.88 45.20 45.97
Winogrande 49.49 50.59 50.67

The reasoning and commonsense reasoning seems to be better when the filter is on, aligning with expectation. It is also close to Cosmopedia.
MMLU is better also; however it is close to random due to limitations in compute (both training time and model size).
Models of larger size will be trained to further validate this claim.

4.2.4 Web Domain Name Analysis

The expectation is that most educational value comes from websites of universities/ schools, research institutes and organizations.
Since HuggingFaceFW/fineweb contains the url of the crawled website, the average educational value of each domain name can be calculated.
The first 10M records have been analyzed. Full file here.

Below are the top 100 domain names, with no. of record >= 100.

image/png

4.2.5 Existing Pretraining Dataset

The classifier is applied to various datasets, and it aligns with expectation.

  • In general, the synthetic data has higher education value because they are created with a high educational value by design.
  • For real data, HuggingFaceFW/fineweb and Dolma v1_7, which applied quality filter described in here, have the highest educational value across all real data.
  • In general, the later a dataset is released, the higher the educational value it is because of increasing focus on data quality in the research community.
  • Textbook category (mostly synthetic) scores the highest, because they are created for educational value, reflecting the effectiveness of this model.
  • Maths/ paper category scores the second highest, because of its density of knowledge.
  • Wikipedia scores comparatively lower because it also contains information (e.g. result of a match, award of a movie star) that has smaller educational value.
  • Web scores low (if no filtering is applied) because it contains all domains.
  • Meme scores the lowest as expected. Hateful memes almost got zero points.

Indeed, it is actually not surprising to deduce that pretraining data with higher educational value leads to better LLM performance in benchmark, and therefore with reasonable number of experiment runs, researcher and practitioner can predict benchmark performance with educational value even before training by establishing regression analysis of performance against educational value.

There are two compute bottlenecks, namely, model training compute, and educational value inference compute. The second bottleneck is removed by the proposed classifier, which can inference on massive data with a throughput of more than 2000 documents per second.

Dataset Sampling Average Educational Value Type
SciPhi/textbooks-are-all-you-need-lite First 100,000 1.846 Synthetic
nampdn-ai/tiny-orca-textbooks First 100,000 1.673 Synthetic
HuggingFaceTB/cosmopedia stanford First 100,000 1.673 Synthetic
vikp/textbook_quality_programming First 100,000 1.663 Synthetic
HuggingFaceTB/cosmopedia web_samples_v1 First 100,000 1.618 Synthetic
nampdn-ai/tiny-textbooks First 100,000 1.586 Synthetic
HuggingFaceTB/cosmopedia web_samples_v2 First 100,000 1.562 Synthetic
HuggingFaceTB/cosmopedia openstax First 100,000 1.462 Synthetic
HuggingFaceTB/cosmopedia wikihow First 100,000 1.422 Synthetic
HuggingFaceTB/cosmopedia khanacademy First 100,000 1.419 Synthetic
HuggingFaceTB/cosmopedia auto_math_text First 100,000 1.347 Synthetic
armanc/scientific_papers pubmed First 100,000 1.260 Real
HuggingFaceTB/cosmopedia stories First 100,000 1.154 Synthetic
teknium/OpenHermes-2.5 First 100,000 1.121 Synthetic
timdettmers/openassistant-guanaco First 100,000 1.115 Real
open-web-math/open-web-math First 100,000 1.089 Real
armanc/scientific_papers arxiv First 100,000 1.068 Real
HuggingFaceFW/fineweb First 100,000 1.056 Real
NousResearch/dolma-v1_7-305B* First 100,000 1.037 Real
tatsu-lab/alpaca First 100,000 1.020 Synthetic
BEE-spoke-data/fineweb-100k_en-med First 100,000 1.019 Real
JeanKaddour/minipile First 100,000 0.998 Real
togethercomputer/RedPajama-Data-V2 en 2023-06 First 100,000 0.985 Real
wikipedia en 20220301 First 100,000 0.975 Real
Replete-AI/code_bagel First 100,000 0.950 Synthetic
allenai/c4 en First 100,000 0.934 Real
mattymchen/refinedweb-3m First 100,000 0.857 Real
iamtarun/python_code_instructions_18k_alpaca First 100,000 0.849 Synthetic
tiiuae/falcon-refinedweb First 100,000 0.835 Real
BEE-spoke-data/FineMeme-100k First 100,000 0.716 Real
neuralcatcher/hateful_memes First 100,000 0.070 Real
* We encountered an issue so that we cannot process the original allenai/dolma.

4.2.6. Benchmark with HuggingFactTB/fineweb-edu-classifier

Our work was performed independently from fineweb-edu-classifier, as our model was released in mid May 2024. It is glad to see that HuggingFace FineWeb-Edu has validated our original research goal that training with educational value classifier leads to better LLM performance at same training token at a larger value than we could not due to limited budget.

While both objectives are to classify the educational value of a document, it is interesting to note the differences, as tabled below:

HuggingFaceTB/fineweb-edu-classifier (fineweb-edu-classifier) kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2 (the classifier)
Training Dataset Sample from FineWeb (450k) MiniPile (1 million)
Granularity of Label in LLM Annotation 6 classes (explicitly defined) 2 classes (generic)
Label Construction LLM Annotation Logits of LLM continuation
Annotation Model LLama3-70B-instruct Phi-3-mini-128k-instruct
Modeling Approach Transformer Based Model with Classification Head fastText Text Classification

fineweb-edu-classifier was tested by training larger language models and evaluated across different benchmark, and as such it's helpful to validate our model against fineweb-edu-classifier.

4.2.6.1 MiniPile Test Dataset

On the MiniPile test split, the Spearman Correlation between the classifier and fineweb-edu-classifier is 0.4108. When we calculate the average educative score from the classifier, grouped by the score predicted by fineweb-edu-classifier, it can be seen that the classifier is able to distinguish class<2 and class >=2 quite well, but it is not able to distinguish among class 2, class 3 and class 4 very well.

image/png

To further validate this claim, we formulate it as a binary classification problem where:

  • Label 0 if fineweb-edu-classifier predicts [0, 1]
  • Label 1 else Macro average F1 score is 0.67.

We released a benchmark dataset that the prediction of both models are calculated for every document in MiniPile so that interested reader can compare the difference and similarities of results.

4.2.6.2 FineWeb-Edu Dataset

We further applied the classifier on the first 100,000 records on FineWeb-Edu. The average educational value is 1.37, which makes FineWeb-Edu the highest score real dataset in section 4.2.5.

The distribution of educational value is right-skewed, with 87.18% of the record having an educational value of >= 1.0, meaning we will keep 87.18% of the data if our classifer was applied.

image/png

4.2.6.3 fastText vs Transformer Based Model

To understand the methodological difference contribution to difference in prediction, another fastText classifier ("fineweb-edu-fasttext-classifier") is trained on HuggingFaceFW/fineweb-edu-llama3-annotations.

Label kenhktsui/fineweb-edu-fasttext-classifier HuggingFaceFW/fineweb-edu-classifier
0 0.55 0.59
1 0.80 0.81
2 0.50 0.59
3 0.39 0.53
4 0.06 0.44
5 0.00 0.02

Label 0, 1, 2 are comparable to the original model. The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fastText model. This aligns with the observation in section 4.2.6.1.

Model Spearman Correlation in MiniPile Test
fineweb-edu-fasttext-classifier 0.5832
llm-data-textbook-quality-fasttext-classifier-v2 0.4108

The Spearman Correlation between fineweb-edu-fasttext-classifier and the HuggingFaceFW/fineweb-edu-classifier in MiniPile test split is 0.5832 but not higher given the same training data. The main reason is that fasTText model does not capture the highest educational value well given its limited capacity. The rest of the difference can be attributed to training Dataset, label construction and Annotation model as descirbed in Section 4.2.6.

4.2.6.4 Prompt Difference in Education Value Annotation

There are 1,778 records where our classifier predicts an educational value >=1, while fineweb-edu-classifier predicts [0, 1]. To isolate annotation model difference, we prompted Phi-3-mini-4k-instruct with our prompt and fineweb-edu-classifier's prompt.

Out of records with extractable score, 45% keeps the same rating as Llama-3-70B-Instruct, 33% (13%) gives 1 point higher (lower), which reflects the annotation model difference between Phi-3-mini-4k-instruct and Llama-3-70B-Instruct.

The rest of the differences can be attributale to definition of educational value. By inspection, these are the reasons why fineweb-edu-classifier predicts a lower score, which aligns with specificity of their prompt.

  • complexity unsuitable for primary or grade school students
  • does not align closely to educational standards or provide extensive learning material suitable for primary or grade school levels The classifier used a more generic and implicit prompt by not giving explicit instruction for educational value annotations, which is not only limited to primary or grade school students; and does not enforce adherence to education standards.

The different definitions (loosely defined and explicitly defined) of educational value explain most of the difference. It might not be universal to say which classifier is better as it is use case specific. The best could lie in the combination of both or more in some circumstances.

For the full dataset, please refer to kenhktsui/edu-value-annotation-difference-hf-edu-score-le2-tbq-v2-score-ge1.

4.2.7 Limitation of the Classifier

It is known that the classifier cannot detect hallucination, and it will not perform well on non-web data, which is not what it is trained on.

5. Discussion and Future Work

In the past, the mainstream was to scale up language model, and then data to achieve SOTA results. It is very welcoming to see more and more effort had been put on data quality, apart from scaling up model parameters.

The low latency classifier and fineweb-edu-fasttext-classifier present a promising way to 1) filter dataset in a cheap and scalable way and 2) evaluating pretraining dataset at scale, before pretraining, that will help researchers and practitioners with less compute resources to train large/ small language model in a more efficient way.

We expect the research community would put futher more effort into data quality in the future, and there are several directions worth exploring.

Definition of Educational Value: As seen in Section 4.2.6.4, educational value is a very subjective matter because it varies from person to person. For example, to an accountant, machine learning knowledge might not have as high educational value as international financial reporting standard is. Our attempt tries to be as implicit as possible so that it captures the “average” educational value for a student. Indeed it shall be highly personalized.

Scaling Law of Educational Value: With more experiment runs available, with the educational value known before pretraining, meta analysis could be used as an proxy to predict LLM performance before training. The educational value not only promotes focus and standardisation of data quality, but also facilitates LLM personalisation.

Active Crawling and Licensing of Data: Instead of passive reliance on snapshot of Common Crawl which is only a subset of web data, with %url uncorrelated with educational value; active crawling and licensing can be done if domains of high educational value are identified. Section 4.2.4 reveals a starting point.

Multilingual and Multimodality: There is no reason not to extend the finding, where training data of higher educational value leads to higher model performance, to other languages and multimodalities.

Limit of Small and Large Language Model: How far can a small language model go given a perfectly educational dataset? How far can a large language model go given a perfectly educational dataset?

6. Reference

[1] Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
[2] https://github.com/meta-llama/llama3
[3] Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[4] Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio Cesar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
[5] Jean Kaddour. The minipile challenge for data-efficient language models. arXiv preprint arXiv:2304.08442, 2023.
[6] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of tricks for efficient text classification. arXiv preprint arXiv:1610.08229, 2016.

Citation

To cite this blog, please use:

@misc{ktsui2024cpueduvalue,
      title={Low Latency CPU Based Educational Value Classifier With Generic Educational Value}, 
      author={Ken Tsui and Huu Nguyen},
      year={2024},
}