NB Education Quality Regressor

Introduction

This model is designed to rate the quality of Norwegian training corpora based on educational content. It predicts a continuous score (float from 0 to 5), assessing the educational value of Norwegian texts. The model is inspired by the classifiers used in the FineWeb project and is trained mainly on Norwegian content.

Model Architecture

It is trained on top of the nb-bert-base model and utilizes code from CosmoPedia.

Training Data

The dataset used for training is derived from GlotCC and has been annotated using Gemini 1.5 Flash.

Purpose

The performance of large language models (LLMs) heavily depends on the quality and size of their pretraining datasets. This regressor aims to assess and enhance the educational value of Norwegian textual data, contributing to better-performing Norwegian LLMs.

This model is part of a pair; the other is the NB Linguistic Quality Regressor, which focuses on linguistic quality.

Using the Model

For convenience we also provide the run_regressor_bert.py script. This is also based on run_edu_bert.py from Cosmopedia. You can modify this script to annotate HuggingFace datasets directly. Cosmopedia also provides slurm-scripts here. We have not included these since we have had the opportunity to test them.

Training and Evaluation Procedure

The following command where used for training. Please note that train_regressor_bert.py has a few minor changes to the original train_edu_bert.py:

 python train_regressor_bert.py --base_model_name="NbAiLab/nb-bert-base" --dataset_name="user/educational-annotations" --target_column="score" --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/"

The following script where used for evaluation.

 python eval_regressor_bert.py --checkpoint_dir="/home/user/checkpoints/scandinavian_bert/final/" --dataset_name="user/educational-annotations"

Classification Report

Class	Precision	Recall	F1-Score	Support
0	0.78	0.70	0.74	18274
1	0.67	0.75	0.71	23348
2	0.49	0.47	0.48	6621
3	0.47	0.26	0.33	1314
4	0.60	0.07	0.12	433
5	0.00	0.00	0.00	10

Metric	Value
Accuracy	0.68
Macro Avg
- Precision	0.50
- Recall	0.38
- F1-Score	0.40
Weighted Avg
- Precision	0.68
- Recall	0.68
- F1-Score	0.67
Total Support	50000

Confusion Matrix

	Class 0	Class 1	Class 2	Class 3	Class 4	Class 5
Class 0	12873	5327	74	0	0	0
Class 1	3486	17582	2238	41	1	0
Class 2	75	3244	3105	197	0	0
Class 3	5	206	746	338	19	0
Class 4	0	45	217	140	30	1
Class 5	0	1	8	1	0	0

Evaluation Metrics

Metric	Value
Eval Loss	0.2926119863986969
Eval Precision	0.5010686403845288
Eval Recall	0.37549345115259253
Eval F1 Macro	0.39714660593426115
Eval Accuracy	0.67856
Eval Runtime	86.0674
Eval Samples Per Second	580.94
Eval Steps Per Second	4.543
Epoch	19.91

Training Metrics

Metric	Value
Loss	0.2803
Grad Norm	0.5055287480354309
Learning Rate	5.119453924914675e-07
Epoch	19.97

Training Runtime

Metric	Value
Train Runtime	19555.3448
Train Samples Per Second	460.232
Train Steps Per Second	1.798
Train Loss	0.29856721191276053
Epoch	20.0