File size: 4,154 Bytes
45d2335 6959951 8ccaaa3 45d2335 7c6eca3 45d2335 23f79cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
license: odc-by
language:
- en
library_name: fasttext
pipeline_tag: text-classification
datasets:
- HuggingFaceFW/fineweb-edu-llama3-annotations
---
# FineWeb-Edu FastText classifier
## Model summary
This is a FastText classifier for judging the educational value of web pages based on training data [fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations).
There are two objectives:
- ⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU.
- 🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)?
The FastText approach is inspired by my independent development of educational classifier based on a different definition of educational value, which can be found at [kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2).
## 🛠️Usage
```python
from typing import List
import re
from huggingface_hub import hf_hub_download
import fasttext
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin"))
def replace_newlines(text: str) -> str:
return re.sub("\n+", " ", text)
def predict(text_list: List[str]) -> List[dict]:
text_list = [replace_newlines(text) for text in text_list]
pred = model_hf.predict(text_list)
return [{"label": int(l[0].lstrip("__label__")), "score": s[0]}
for l, s in zip(*pred)]
predict(["Hi"])
# Output: [{'label': 0, 'score': 1.00001}]
```
## 📊Evaluation
The last 46867 samples are used as test data, but it's not the exact test data as in [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
### Classification Report
```
precision recall f1-score support
0 0.72 0.44 0.55 5704
1 0.73 0.87 0.80 26595
2 0.52 0.49 0.50 10350
3 0.48 0.33 0.39 3397
4 0.69 0.03 0.06 819
5 0.00 0.00 0.00 2
accuracy 0.68 46867
macro avg 0.52 0.36 0.38 46867
weighted avg 0.67 0.68 0.66 46867
```
The below table compares FastText model vs transformer based model.
Label|This Model| [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
-----|-----|----
0|0.55 | 0.59
1|0.80 | 0.81
2|0.50 | 0.59
3|0.39 | 0.53
4|0.06 | 0.44
5|0.00 | 0.02
Label 0, 1, 2 are comparable to the original model.
The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model.
So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation.
### Confusion Matrix
```
[ 2537 3098 65 4 0 0]
[ 944 23037 2491 123 0 0]
y_true [ 26 4742 5048 533 1 0]
[ 4 434 1846 1105 8 0]
[ 0 38 213 544 24 0]
[ 0 0 0 0 2 0]
y_pred
```
The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data.
Predicted - Actual Rating | Frequency | %
-----|-----|----
0|31751 | 67.7%
-1|8078 | 17.2%
+1| 6130 | 13.1%
-2|673 | 1.4%
+2|189 | 0.4%
-3|42 | 0.1%
+3|4 | 0.0%
### Alignment with [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)
Spearman rank-order correlation coefficient is 0.5881 in MiniPile train split and 0.5832 in test split, indicating a moderately strong monotonic relationship in over 1 million representative document in web data.
|