File size: 7,793 Bytes
efc3a19
 
a1ab60b
 
 
 
 
 
efc3a19
 
a1ab60b
efc3a19
 
a1ab60b
efc3a19
a1ab60b
efc3a19
a1ab60b
efc3a19
 
 
 
 
a1ab60b
efc3a19
a1ab60b
 
 
 
 
 
 
 
 
 
 
efc3a19
 
a1ab60b
efc3a19
a1ab60b
 
 
 
 
 
 
 
efc3a19
 
a1ab60b
efc3a19
a1ab60b
efc3a19
a1ab60b
efc3a19
a1ab60b
 
efc3a19
a1ab60b
 
efc3a19
a1ab60b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
efc3a19
a1ab60b
efc3a19
a1ab60b
 
efc3a19
a1ab60b
 
efc3a19
a1ab60b
efc3a19
a1ab60b
 
efc3a19
a1ab60b
efc3a19
 
 
a1ab60b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
library_name: transformers
license: cc-by-4.0
datasets:
- uonlp/CulturaX
language:
- uk
pipeline_tag: fill-mask
---

# LiBERTa-V2

<!-- Provide a quick summary of what the model is/does. -->
LiBERTa Large is a BERT-like model pre-trained from scratch exclusively for Ukrainian. It was presented during the [UNLP](https://unlp.org.ua/) @ [LREC-COLING 2024](https://lrec-coling-2024.org/). Further details are in the [LiBERTa: Advancing Ukrainian Language Modeling through Pre-training from Scratch](https://aclanthology.org/2024.unlp-1.14/) paper.

The second version follows the same procedure. The only improvements are whole-word masking, a new tokenizer with a bigger vocabulary, and a longer training.

All the code is available in the [Goader/ukr-lm](https://github.com/Goader/ukr-lm) repository.

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

Read the [paper](https://aclanthology.org/2024.unlp-1.14/) for more detailed tasks descriptions.

|                                                                                                                         | NER-UK (Micro F1)   | WikiANN (Micro F1) | UD POS (Accuracy)              | News (Macro F1) |
|:------------------------------------------------------------------------------------------------------------------------|:------------------------:|:------------------:|:------------------------------:|:----------------------------------------:|
| <tr><td colspan="5" style="text-align: center;"><strong>Base Models</strong></td></tr>
| [xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base)                                                  | 90.86 (0.81)             | 92.27 (0.09)       | 98.45 (0.07)                   | -                                        |
| [roberta-base-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-base-wechsel-ukrainian)                        | 90.81 (1.51)             | 92.98 (0.12)       | 98.57 (0.03)                   | -                                        |
| [electra-base-ukrainian-cased-discriminator](https://huggingface.co/lang-uk/electra-base-ukrainian-cased-discriminator) | 90.43 (1.29)             | 92.99 (0.11)       | 98.59 (0.06)                   | -                                        |
| <tr><td colspan="5" style="text-align: center;"><strong>Large Models</strong></td></tr>
| [xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)                                                | 90.16 (2.98)             | 92.92 (0.19)       | 98.71 (0.04)                   | 95.13 (0.49)                             |
| [roberta-large-wechsel-ukrainian](https://huggingface.co/benjamin/roberta-large-wechsel-ukrainian)                      | 91.24 (1.16)             | __93.22 (0.17)__   | 98.74 (0.06)                   | __96.48 (0.09)__                         |
| [liberta-large](https://huggingface.co/Goader/liberta-large)                                                            | 91.27 (1.22)             | 92.50 (0.07)       | 98.62 (0.08)                   | 95.44 (0.04)                             |
| [liberta-large-v2](https://huggingface.co/Goader/liberta-large-v2)                                                      | __91.73 (1.81)__         | __93.22 (0.14)__   | __98.79 (0.06)__               | 95.67 (0.12)                             |


## Fine-Tuning Hyperparameters

| Hyperparameter | Value |
|:---------------|:-----:|
| Peak Learning Rate  | 3e-5   |
| Warm-up Ratio       | 0.05   |
| Learning Rate Decay | Linear |
| Batch Size          | 16     |
| Epochs              | 10     |
| Weight Decay        | 0.05   |


## How to Get Started with the Model

Use the code below to get started with the model. Note, that the repository contains custom code for tokenization:

Pipeline usage:

```python
>>> from transformers import pipeline

>>> fill_mask = pipeline("fill-mask", "Goader/liberta-large-v2", trust_remote_code=True)
>>> fill_mask("Тарас Шевченко - один з найвизначніших <mask> України.")

[{'score': 0.37743982672691345,
  'token': 23179,
  'token_str': 'поетів',
  'sequence': 'Тарас Шевченко - один з найвизначніших поетів України.'},
 {'score': 0.3221002519130707,
  'token': 12095,
  'token_str': 'письменників',
  'sequence': 'Тарас Шевченко - один з найвизначніших письменників України.'},
 {'score': 0.05367676541209221,
  'token': 17491,
  'token_str': 'художників',
  'sequence': 'Тарас Шевченко - один з найвизначніших художників України.'},
 {'score': 0.04778451472520828,
  'token': 17124,
  'token_str': 'синів',
  'sequence': 'Тарас Шевченко - один з найвизначніших синів України.'},
 {'score': 0.04660917446017265,
  'token': 1354,
  'token_str': 'людей',
  'sequence': 'Тарас Шевченко - один з найвизначніших людей України.'}]
```

Extracting embeddings:

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Goader/liberta-large-v2", trust_remote_code=True)
model = AutoModel.from_pretrained("Goader/liberta-large-v2")

encoded = tokenizer('Тарас Шевченко - один з найвизначніших поетів України.', return_tensors='pt')

output = model(**encoded)
```

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

```
@inproceedings{haltiuk-smywinski-pohl-2024-liberta,
    title = "{L}i{BERT}a: Advancing {U}krainian Language Modeling through Pre-training from Scratch",
    author = "Haltiuk, Mykola  and
      Smywi{\'n}ski-Pohl, Aleksander",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.14",
    pages = "120--128",
    abstract = "Recent advancements in Natural Language Processing (NLP) have spurred remarkable progress in language modeling, predominantly benefiting English. While Ukrainian NLP has long grappled with significant challenges due to limited data and computational resources, recent years have seen a shift with the emergence of new corpora, marking a pivotal moment in addressing these obstacles. This paper introduces LiBERTa Large, the inaugural BERT Large model pre-trained entirely from scratch only on Ukrainian texts. Leveraging extensive multilingual text corpora, including a substantial Ukrainian subset, LiBERTa Large establishes a foundational resource for Ukrainian NLU tasks. Our model outperforms existing multilingual and monolingual models pre-trained from scratch for Ukrainian, demonstrating competitive performance against those relying on cross-lingual transfer from English. This achievement underscores our ability to achieve superior performance through pre-training from scratch with additional enhancements, obviating the need to rely on decisions made for English models to efficiently transfer weights. We establish LiBERTa Large as a robust baseline, paving the way for future advancements in Ukrainian language modeling.",
}
```

## Licence

CC-BY 4.0

## Authors

The model was trained by Mykola Haltiuk as a part of his Master's Thesis under the supervision of Aleksander Smywiński-Pohl, PhD, AGH University of Krakow.