|
--- |
|
license: odc-by |
|
language: |
|
- en |
|
library_name: fasttext |
|
pipeline_tag: text-classification |
|
datasets: |
|
- HuggingFaceFW/fineweb-edu-llama3-annotations |
|
--- |
|
# FineWeb-Edu FastText classifier |
|
|
|
## Model summary |
|
This is a FastText classifier for judging the educational value of web pages based on training data [fineweb-edu-llama3-annotations](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-llama3-annotations). |
|
There are two objectives: |
|
- ⚡ throughput optimisation: It can classify more than 2000 examples per second in CPU, and so it can be used on-the-fly during pretraining/ to process huge data with CPU. |
|
- 🧪fasttext vs transformer based model: How does this lightweight model with limited capacity compare to the original model [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier)? |
|
|
|
|
|
The FastText approach is inspired by my independent development of educational classifier based on a different definition of educational value, which can be found at [kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2](https://huggingface.co/kenhktsui/llm-data-textbook-quality-fasttext-classifier-v2). |
|
|
|
|
|
## 🛠️Usage |
|
```python |
|
from typing import List |
|
import re |
|
from huggingface_hub import hf_hub_download |
|
import fasttext |
|
|
|
|
|
model_hf = fasttext.load_model(hf_hub_download("kenhktsui/fineweb-edu-fasttext-classifier", "model.bin")) |
|
|
|
|
|
def replace_newlines(text: str) -> str: |
|
return re.sub("\n+", " ", text) |
|
|
|
|
|
def predict(text_list: List[str]) -> List[dict]: |
|
text_list = [replace_newlines(text) for text in text_list] |
|
pred = model_hf.predict(text_list) |
|
return [{"label": int(l[0].lstrip("__label__")), "score": s[0]} |
|
for l, s in zip(*pred)] |
|
|
|
|
|
predict(["Hi"]) |
|
# Output: [{'label': 0, 'score': 1.00001}] |
|
|
|
``` |
|
|
|
## 📊Evaluation |
|
The last 46867 samples are used as test data, but it's not the exact test data as in [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
|
### Classification Report |
|
``` |
|
precision recall f1-score support |
|
|
|
0 0.72 0.44 0.55 5704 |
|
1 0.73 0.87 0.80 26595 |
|
2 0.52 0.49 0.50 10350 |
|
3 0.48 0.33 0.39 3397 |
|
4 0.69 0.03 0.06 819 |
|
5 0.00 0.00 0.00 2 |
|
|
|
accuracy 0.68 46867 |
|
macro avg 0.52 0.36 0.38 46867 |
|
weighted avg 0.67 0.68 0.66 46867 |
|
``` |
|
|
|
The below table compares FastText model vs transformer based model. |
|
|
|
Label|This Model| [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
|
-----|-----|---- |
|
0|0.55 | 0.59 |
|
1|0.80 | 0.81 |
|
2|0.50 | 0.59 |
|
3|0.39 | 0.53 |
|
4|0.06 | 0.44 |
|
5|0.00 | 0.02 |
|
|
|
Label 0, 1, 2 are comparable to the original model. |
|
The performance degradation starts to be noticeable in label 3, and widen further in 4, which is due to limited capacity of fasttext model. |
|
So, this classifer can perform reasonably well in label 0, 1, 2, and also 3 with some degradation. |
|
|
|
### Confusion Matrix |
|
``` |
|
[ 2537 3098 65 4 0 0] |
|
[ 944 23037 2491 123 0 0] |
|
y_true [ 26 4742 5048 533 1 0] |
|
[ 4 434 1846 1105 8 0] |
|
[ 0 38 213 544 24 0] |
|
[ 0 0 0 0 2 0] |
|
y_pred |
|
``` |
|
|
|
The model has a accuracy of 68%, and it is more likely to underpredict educational value than overpredict so. The exhibited conservatism is good for filtering large amount of data. |
|
|
|
Predicted - Actual Rating | Frequency | % |
|
-----|-----|---- |
|
0|31751 | 67.7% |
|
-1|8078 | 17.2% |
|
+1| 6130 | 13.1% |
|
-2|673 | 1.4% |
|
+2|189 | 0.4% |
|
-3|42 | 0.1% |
|
+3|4 | 0.0% |
|
|
|
|
|
### Alignment with [HuggingFaceFW/fineweb-edu-classifier](https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier) |
|
Spearman rank-order correlation coefficient is 0.5881 in MiniPile train split and 0.5832 in test split, indicating a moderately strong monotonic relationship in over 1 million representative document in web data. |
|
|