NB Education Quality Regressor
Introduction
This model is designed to rate the quality of Norwegian training corpora based on educational content. It predicts a continuous score (float from 0 to 5), assessing the educational value of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.
Model Architecture
It is trained on top of the nb-bert-base model and utilizes code from CosmoPedia.
Training Data
The dataset used for training is derived from GlotCC and has been annotated using Gemini 1.5 Flash.
Purpose
The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the educational value of Norwegian textual data, contributing to better-performing Norwegian LLMs.
This model is part of a pair; the other is the NB Linguistic Quality Regressor, which focuses on linguistic quality.
Using the Model
For convenience we also provide the run_regressor_bert.py
script. This is also based on run_edu_bert.py
from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.
Training and Evaluation Procedure
The following command where used for training. Please note that train_regressor_bert.py
has a few minor changes to the original train_edu_bert.py
:
python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/educational-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"
The following script where used for evaluation.
python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/educational-annotations"
Classification Report
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.78 | 0.70 | 0.74 | 18274 |
1 | 0.67 | 0.75 | 0.71 | 23348 |
2 | 0.49 | 0.47 | 0.48 | 6621 |
3 | 0.47 | 0.26 | 0.33 | 1314 |
4 | 0.60 | 0.07 | 0.12 | 433 |
5 | 0.00 | 0.00 | 0.00 | 10 |
Metric | Value |
---|---|
Accuracy | 0.68 |
Macro Avg | |
- Precision | 0.50 |
- Recall | 0.38 |
- F1-Score | 0.40 |
Weighted Avg | |
- Precision | 0.68 |
- Recall | 0.68 |
- F1-Score | 0.67 |
Total Support | 50000 |
Confusion Matrix
Class 0 | Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | |
---|---|---|---|---|---|---|
Class 0 | 12873 | 5327 | 74 | 0 | 0 | 0 |
Class 1 | 3486 | 17582 | 2238 | 41 | 1 | 0 |
Class 2 | 75 | 3244 | 3105 | 197 | 0 | 0 |
Class 3 | 5 | 206 | 746 | 338 | 19 | 0 |
Class 4 | 0 | 45 | 217 | 140 | 30 | 1 |
Class 5 | 0 | 1 | 8 | 1 | 0 | 0 |
Evaluation Metrics
Metric | Value |
---|---|
Eval Loss | 0.2926119863986969 |
Eval Precision | 0.5010686403845288 |
Eval Recall | 0.37549345115259253 |
Eval F1 Macro | 0.39714660593426115 |
Eval Accuracy | 0.67856 |
Eval Runtime | 86.0674 |
Eval Samples Per Second | 580.94 |
Eval Steps Per Second | 4.543 |
Epoch | 19.91 |
Training Metrics
Metric | Value |
---|---|
Loss | 0.2803 |
Grad Norm | 0.5055287480354309 |
Learning Rate | 5.119453924914675e-07 |
Epoch | 19.97 |
Training Runtime
Metric | Value |
---|---|
Train Runtime | 19555.3448 |
Train Samples Per Second | 460.232 |
Train Steps Per Second | 1.798 |
Train Loss | 0.29856721191276053 |
Epoch | 20.0 |
- Downloads last month
- 6