--- license: cc-by-sa-4.0 language: - hr - bs - sr --- # XLM-R-BERTić This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages. # Benchmarking Three tasks were chosen for model evaluation: * Named Entity Recognition (NER) * Sentiment regression * COPA (Choice of plausible alternatives) In all cases, this model was finetuned for specific downstream tasks. ## NER Mean F1 scores were used to evaluate performance. | system | dataset | F1 score | |:-----------------------------------------------------------------------|:----|------:| | **XLM-R-BERTić** | hr500k | 0.927 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | hr500k | 0.925 | | XLM-R-SloBERTić | hr500k | 0.923 | | XLM-Roberta-Large |hr500k | 0.919 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k | 0.918 | | XLM-Roberta-Base | hr500k | 0.903 | ## Sentiment regression [ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment). | system | train | test | r^2 | |:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:| | [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 | | XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 | | XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 | | **XLM-R-BERTić** | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 | | XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 | | dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 | ## COPA | system | dataset | Accuracy score | |:-----------------------------------------------------------------------|:----|------:| | [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-SR | 0.689 | | XLM-R-SloBERTić | Copa-SR | 0.665 | | **XLM-R-BERTić** | Copa-SR | 0.637 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 | | XLM-Roberta-Base | Copa-SR | 0.573 | | XLM-Roberta-Large |Copa-SR | 0.570 | # Citation (to be added soon) # Authors * [Nikola Ljubešič](https://huggingface.co/nljubesi)