XLM-R-BERTić
This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset
Benchmarking
Three tasks were chosen for model evaluation:
- Named Entity Recognition (NER)
- Sentiment regression
- COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
NER
Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | hr500k | 0.927 |
BERTić | hr500k | 0.925 |
XLM-R-SloBERTić | hr500k | 0.923 |
XLM-Roberta-Large | hr500k | 0.919 |
crosloengual-bert | hr500k | 0.918 |
XLM-Roberta-Base | hr500k | 0.903 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | ReLDI-hr | 0.812 |
XLM-R-BERTić | ReLDI-hr | 0.809 |
crosloengual-bert | ReLDI-hr | 0.794 |
BERTić | ReLDI-hr | 0.792 |
XLM-Roberta-Large | ReLDI-hr | 0.791 |
XLM-Roberta-Base | ReLDI-hr | 0.763 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | SETimes.SR | 0.949 |
XLM-R-BERTić | SETimes.SR | 0.940 |
BERTić | SETimes.SR | 0.936 |
XLM-Roberta-Large | SETimes.SR | 0.933 |
crosloengual-bert | SETimes.SR | 0.922 |
XLM-Roberta-Base | SETimes.SR | 0.914 |
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | ReLDI-sr | 0.841 |
XLM-R-SloBERTić | ReLDI-sr | 0.824 |
BERTić | ReLDI-sr | 0.798 |
XLM-Roberta-Large | ReLDI-sr | 0.774 |
crosloengual-bert | ReLDI-sr | 0.751 |
XLM-Roberta-Base | ReLDI-sr | 0.734 |
Sentiment regression
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.
system | train | test | r^2 |
---|---|---|---|
xlm-r-parlasent | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
XLM-R-BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
crosloengual-bert | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
COPA
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-SR | 0.689 |
XLM-R-SloBERTić | Copa-SR | 0.665 |
XLM-R-BERTić | Copa-SR | 0.637 |
crosloengual-bert | Copa-SR | 0.607 |
XLM-Roberta-Base | Copa-SR | 0.573 |
XLM-Roberta-Large | Copa-SR | 0.570 |
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-HR | 0.669 |
XLM-R-SloBERTić | Copa-HR | 0.628 |
XLM-R-BERTić | Copa-HR | 0.635 |
crosloengual-bert | Copa-HR | 0.669 |
XLM-Roberta-Base | Copa-HR | 0.585 |
XLM-Roberta-Large | Copa-HR | 0.571 |
Citation
Please cite the following paper:
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}
- Downloads last month
- 4,207
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.