--- license: cc-by-sa-4.0 language: - hr - bs - sr --- # XLM-R-BERTić This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages. # Benchmarking Three tasks were chosen for model evaluation: * Named Entity Recognition (NER) * Sentiment regression * COPA (Choice of plausible alternatives) In all cases, this model was finetuned for specific downstream tasks. ## NER Mean F1 scores were used to evaluate performance. | system | dataset | F1 score | |:-----------------------------------------------------------------------|:--------|---------:| | **XLM-R-BERTić** (this model) | hr500k | 0.927 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | hr500k | 0.925 | | XLM-R-SloBERTić | hr500k | 0.923 | | XLM-Roberta-Large | hr500k | 0.919 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k | 0.918 | | XLM-Roberta-Base | hr500k | 0.903 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:---------|---------:| | XLM-R-SloBERTić | ReLDI-hr | 0.812 | | **XLM-R-BERTić** (this model) | ReLDI-hr | 0.809 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-hr | 0.794 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-hr | 0.792 | | XLM-Roberta-Large | ReLDI-hr | 0.791 | | XLM-Roberta-Base | ReLDI-hr | 0.763 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:-----------|---------:| | XLM-R-SloBERTić | SETimes.SR | 0.949 | | **XLM-R-BERTić** (this model) | SETimes.SR | 0.940 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | SETimes.SR | 0.936 | | XLM-Roberta-Large | SETimes.SR | 0.933 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | SETimes.SR | 0.922 | | XLM-Roberta-Base | SETimes.SR | 0.914 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:---------|---------:| | **XLM-R-BERTić** (this model) | ReLDI-sr | 0.841 | | XLM-R-SloBERTić | ReLDI-sr | 0.824 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-sr | 0.798 | | XLM-Roberta-Large | ReLDI-sr | 0.774 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-sr | 0.751 | | XLM-Roberta-Base | ReLDI-sr | 0.734 | ## Sentiment regression [ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment). | system | train | test | r^2 | |:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:| | [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 | | XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 | | XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 | | **XLM-R-BERTić** (this model) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 | | XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 | | dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 | ## COPA | system | dataset | Accuracy score | |:-----------------------------------------------------------------------|:--------|---------------:| | [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-SR | 0.689 | | XLM-R-SloBERTić | Copa-SR | 0.665 | | **XLM-R-BERTić** (this model) | Copa-SR | 0.637 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 | | XLM-Roberta-Base | Copa-SR | 0.573 | | XLM-Roberta-Large | Copa-SR | 0.570 | | system | dataset | Accuracy score | |:-----------------------------------------------------------------------|:--------|---------------:| | [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-HR | 0.669 | | XLM-R-SloBERTić | Copa-HR | 0.628 | | **XLM-R-BERTić** (this model) | Copa-HR | 0.635 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-HR | 0.669 | | XLM-Roberta-Base | Copa-HR | 0.585 | | XLM-Roberta-Large | Copa-HR | 0.571 | # Citation (to be added soon) # Authors * [Nikola Ljubešič](https://huggingface.co/nljubesi)