File size: 8,173 Bytes
596094f f5b8a4e b3b89fb 596094f 8dda712 f5b8a4e 042c5ff f5b8a4e 9df745c f5b8a4e 9df745c 843fcf7 948385b 57a5384 3ef95ff 57a5384 1c67627 57a5384 1c67627 57a5384 1c67627 3ef95ff 57a5384 1c67627 57a5384 1c67627 3ef95ff 57a5384 1c67627 57a5384 1c67627 57a5384 3ef95ff 1c67627 57a5384 1c67627 57a5384 1c67627 948385b f5b8a4e 1c67627 3ef95ff f5b8a4e 1c67627 f5b8a4e 57a5384 f5b8a4e ab93741 57a5384 1c67627 3ef95ff 57a5384 1c67627 57a5384 1c67627 3ef95ff 57a5384 1c67627 a0a2d8b 942f421 f5b8a4e a97eb2c 87540d5 18597f6 87540d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
---
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
---
# XLM-R-BERTić
This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co/datasets/classla/xlm-r-bertic-data)
# Benchmarking
Three tasks were chosen for model evaluation:
* Named Entity Recognition (NER)
* Sentiment regression
* COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
## NER
Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co/datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co/datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co/datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co/datasets/classla/setimes_sr).
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:--------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | hr500k | 0.927 |
| [BERTić](https://huggingface.co/classla/bcms-bertic) | hr500k | 0.925 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | hr500k | 0.923 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | hr500k | 0.919 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k | 0.918 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | hr500k | 0.903 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | ReLDI-hr | 0.812 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | ReLDI-hr | 0.809 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-hr | 0.794 |
| [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-hr | 0.792 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ReLDI-hr | 0.791 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ReLDI-hr | 0.763 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:-----------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | SETimes.SR | 0.949 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | SETimes.SR | 0.940 |
| [BERTić](https://huggingface.co/classla/bcms-bertic) | SETimes.SR | 0.936 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | SETimes.SR | 0.933 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | SETimes.SR | 0.922 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | SETimes.SR | 0.914 |
| system | dataset | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | ReLDI-sr | 0.841 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | ReLDI-sr | 0.824 |
| [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-sr | 0.798 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ReLDI-sr | 0.774 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-sr | 0.751 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ReLDI-sr | 0.734 |
## Sentiment regression
[ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages.
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).
| system | train | test | r^2 |
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:|
| [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| [BERTić](https://huggingface.co/classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
## COPA
| system | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-SR | 0.689 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | Copa-SR | 0.665 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | Copa-SR | 0.637 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | Copa-SR | 0.573 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | Copa-SR | 0.570 |
| system | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-HR | 0.669 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic) | Copa-HR | 0.628 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic) | Copa-HR | 0.635 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-HR | 0.669 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | Copa-HR | 0.585 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | Copa-HR | 0.571 |
# Citation
Please cite the following paper:
```
@inproceedings{ljubesic-etal-2024-language,
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
author = "Ljube{\v{s}}i{\'c}, Nikola and
Suchomel, V{\'\i}t and
Rupnik, Peter and
Kuzman, Taja and
van Noord, Rik",
editor = "Melero, Maite and
Sakti, Sakriani and
Soria, Claudia",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.23",
pages = "189--203",
}
``` |