CNEC_xlm-roberta-large

This model is a fine-tuned version of FacebookAI/xlm-roberta-large on the cnec dataset. It achieves the following results on the evaluation set:

Loss: 0.1471
Precision: 0.8567
Recall: 0.9047
F1: 0.8800
Accuracy: 0.9772

Model description

The entities are described as:

'O' = Outside of a named entity
'B-A' = Beginning of a complex address number (Postal code, street number, even phone number)
'I-A' = Inside of a number in the address
'B-G' = Beginning of a geographical name
'I-G' = Inside of a geographical name
'B-I' = Beginning of an institution name
'I-I' = Inside of an institution name
'B-M' = Beginning of a media name (email, server, website, tv series, etc.)
'I-M' = Inside of a media name
'B-O' = Beginning of an artifact name (book, old movies, etc.)
'I-O' = Inside of an artifact name
'B-P' = Beginning of a person's name
'I-P' = Inside of a person's name
'B-T' = Beginning of a time expression
'I-T' = Inside of a time expression

Intended uses & limitations

CNEC or Czech named entity corpus is a dataset aimed at the Czech language. This is an edited version of the dataset with only 7 supertypes and 1 type for non-entity.

Training and evaluation data

The model was trained with an increased dropout rate to 0.2 for hidden_dropout_prob and 0.15 for attention_probs_dropout_prob

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 2e-05
train_batch_size: 16
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
weight_decay = 0.01
num_epochs: 10

Training results

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.2836	1.12	500	0.1341	0.7486	0.8467	0.7946	0.9649
0.116	2.24	1000	0.1048	0.7866	0.8655	0.8242	0.9734
0.0832	3.36	1500	0.1066	0.7967	0.8734	0.8333	0.9746
0.0577	4.47	2000	0.1112	0.8408	0.8834	0.8616	0.9753
0.0445	5.59	2500	0.1378	0.8384	0.8883	0.8627	0.9751
0.0337	6.71	3000	0.1272	0.8505	0.8978	0.8735	0.9770
0.025	7.83	3500	0.1447	0.8462	0.9007	0.8726	0.9760
0.0191	8.95	4000	0.1471	0.8567	0.9047	0.8800	0.9772

Framework versions

Transformers 4.36.2
Pytorch 2.1.2+cu121
Datasets 2.16.1
Tokenizers 0.15.0

stulcrad
/

CNEC_extended_xlm-roberta-large