File size: 3,800 Bytes
596094f
f5b8a4e
 
 
 
 
596094f
8dda712
f5b8a4e
 
 
 
 
 
 
 
 
 
 
 
948385b
 
 
 
 
ab93741
948385b
 
 
 
 
 
f5b8a4e
 
 
 
 
 
 
 
 
 
 
61223d5
f5b8a4e
 
 
 
ab93741
 
 
 
 
 
 
 
 
 
 
 
f5b8a4e
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: cc-by-sa-4.0
language:
- hr
- bs
- sr
---
# XLM-R-BERTić

This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages.

# Benchmarking
Three tasks were chosen for model evaluation:
* Named Entity Recognition (NER)
* Sentiment regression
* COPA (Choice of plausible alternatives)

  
In all cases, this model was finetuned for specific downstream tasks.
## NER
Mean F1 scores were used to evaluate performance.

| system                                                                 | dataset                     |   F1 score |
|:-----------------------------------------------------------------------|:----|------:|
| **XLM-R-BERTić**                                                     |  hr500k | 0.927 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | hr500k | 0.925 |
|   XLM-R-SloBERTić                                                      | hr500k | 0.923 |
| XLM-Roberta-Large                                                      |hr500k  | 0.919 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k | 0.918 |
| XLM-Roberta-Base                                                       | hr500k | 0.903 |


## Sentiment regression

[ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. 
The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment).

| system                                                                 | train               | test                     |   r^2 |
|:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:|
| [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent)      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
|   XLM-R-SloBERTić                                                      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
| XLM-Roberta-Large                                                      | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
| **XLM-R-BERTić**                                                     | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
| XLM-Roberta-Base                                                       | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
| dummy (mean)                                                           | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
## COPA


| system                                                                 | dataset                     |   Accuracy score |
|:-----------------------------------------------------------------------|:----|------:|
| [BERTić](https://huggingface.co/classla/bcms-bertic)                   | Copa-SR | 0.689 |
|   XLM-R-SloBERTić                                                      | Copa-SR | 0.665 |
| **XLM-R-BERTić**                                                     |  Copa-SR | 0.637 |
| [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 |
| XLM-Roberta-Base                                                       | Copa-SR | 0.573 |
| XLM-Roberta-Large                                                      |Copa-SR  | 0.570 |



# Citation
(to be added soon)
# Authors
* [Nikola Ljubešič](https://huggingface.co/nljubesi)