File size: 8,173 Bytes
596094f
f5b8a4e
 
 
 
 
b3b89fb
 
596094f
8dda712
f5b8a4e
042c5ff
f5b8a4e
 
 
 
 
 
 
 
 
9df745c
f5b8a4e
9df745c
843fcf7
948385b
57a5384
 
3ef95ff
57a5384
1c67627
 
57a5384
1c67627
57a5384
 
 
1c67627
3ef95ff
57a5384
 
1c67627
 
57a5384
 
 
1c67627
3ef95ff
57a5384
1c67627
57a5384
1c67627
57a5384
 
 
3ef95ff
1c67627
57a5384
1c67627
57a5384
1c67627
948385b
f5b8a4e
 
 
 
 
 
 
 
 
1c67627
 
3ef95ff
f5b8a4e
1c67627
f5b8a4e
57a5384
 
f5b8a4e
ab93741
 
57a5384
 
 
1c67627
3ef95ff
57a5384
1c67627
 
57a5384
 
 
 
 
1c67627
3ef95ff
57a5384
1c67627
 
a0a2d8b
942f421
 
f5b8a4e
a97eb2c
87540d5
 
18597f6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87540d5
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
---
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
---
# XLM-R-BERTić

This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co/datasets/classla/xlm-r-bertic-data)

# Benchmarking
Three tasks were chosen for model evaluation:
* Named Entity Recognition (NER)
* Sentiment regression
* COPA (Choice of plausible alternatives)

  
In all cases, this model was finetuned for specific downstream tasks.

## NER

Average macro-F1 scores from three runs were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co/datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co/datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co/datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co/datasets/classla/setimes_sr). 

| system                                                                 | dataset | F1 score |
|:-----------------------------------------------------------------------|:--------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | hr500k  |    0.927 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | hr500k  |    0.925 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | hr500k  |    0.923 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | hr500k  |    0.919 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k  |    0.918 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | hr500k  |    0.903 |

| system                                                                 | dataset  | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | ReLDI-hr |    0.812 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | ReLDI-hr |    0.809 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-hr |    0.794 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ReLDI-hr |    0.792 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ReLDI-hr |    0.791 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ReLDI-hr |    0.763 |

| system                                                                 | dataset    | F1 score |
|:-----------------------------------------------------------------------|:-----------|---------:|
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | SETimes.SR |    0.949 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | SETimes.SR |    0.940 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | SETimes.SR |    0.936 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | SETimes.SR |    0.933 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | SETimes.SR |    0.922 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | SETimes.SR |    0.914 |

| system                                                                 | dataset  | F1 score |
|:-----------------------------------------------------------------------|:---------|---------:|
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | ReLDI-sr |    0.841 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | ReLDI-sr |    0.824 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ReLDI-sr |    0.798 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ReLDI-sr |    0.774 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-sr |    0.751 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ReLDI-sr |    0.734 |

## Sentiment regression

[ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. 
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).

| system                                                                 | train               | test                     |   r^2 |
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:|
| [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent)      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean)                                                           | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |


## COPA


| system                                                                 | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | Copa-SR |          0.689 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | Copa-SR |          0.665 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | Copa-SR |          0.637 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR |          0.607 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | Copa-SR |          0.573 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | Copa-SR |          0.570 |


| system                                                                 | dataset | Accuracy score |
|:-----------------------------------------------------------------------|:--------|---------------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | Copa-HR |          0.669 |
| [XLM-R-SloBERTić](https://huggingface.co/classla/xlm-r-slobertic)      | Copa-HR |          0.628 |
| [**XLM-R-BERTić**](https://huggingface.co/classla/xlm-r-bertic)            | Copa-HR |          0.635 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-HR |          0.669 |
| [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base)            | Copa-HR |          0.585 |
| [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large)          | Copa-HR |          0.571 |



# Citation

Please cite the following paper:
```
@inproceedings{ljubesic-etal-2024-language,
    title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
    author = "Ljube{\v{s}}i{\'c}, Nikola  and
      Suchomel, V{\'\i}t  and
      Rupnik, Peter  and
      Kuzman, Taja  and
      van Noord, Rik",
    editor = "Melero, Maite  and
      Sakti, Sakriani  and
      Soria, Claudia",
    booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.sigul-1.23",
    pages = "189--203",
}

```