File size: 2,126 Bytes
40810e6
 
6209f90
 
 
3fdd67d
 
 
 
40810e6
6209f90
b4337ba
3fdd67d
 
 
5a725d3
3fdd67d
 
5a725d3
 
 
3fdd67d
6209f90
 
 
5a725d3
d076f6c
6209f90
5a725d3
1149dad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eb2ba6d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
tags:
- feature-extraction
library_name: generic
datasets:
- ubertext2.0
widget:
- text: "доброго вечора ми з україни"
---

**skipgram.uk.300.bin** is pre-trained word vectors for the Ukrainian language, trained with fastText on (yet unreleased) UberText2.0 dataset, collected and processed by the [lang-uk](https://lang.org.ua/en/). This model was trained using skipgram in dimension 300, with character n-grams range of 2-5, and 15 negative samples. 

Our model increases Accuracy by 6.3% compared to the [Facebook Ukrainian word vectors](https://fasttext.cc/docs/en/crawl-vectors.html) on the word analogy task. The dataset for Ukrainian word analogy is available [here](https://github.com/lang-uk/vecs/).

Extrinsic evaluations were performed on two sequence labeling tasks: NER and POS tagging. [NER-UK dataset](https://github.com/lang-uk/ner-uk) was released by the [lang-uk](https://lang.org.ua/), and [Ukrainian (UD) corpus](https://github.com/brown-uk/corpus) was developed by a [non-profit organization Institute for Ukrainian](https://mova.institute/).

Results:
1) spaCy NER F-score: **0.818**
2) POS Flair Accuracy: **0.824**
3) POS spaCy Accuracy: **0.911**

Usage
```
import fasttext.util
ft = fasttext.load_model('skipgram.uk.300.bin')
ft.get_word_vector('привіт')
```

```bibtex
@inproceedings{romanyshyn-etal-2023-learning,
    title = "Learning Word Embeddings for {U}krainian: A Comparative Study of Fast{T}ext Hyperparameters",
    author = "Romanyshyn, Nataliia  and
      Chaplynskyi, Dmytro  and
      Zakharov, Kyrylo",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.3",
    pages = "20--31",
}
```

Copyright: [Dmytro Chaplynskyi](https://twitter.com/dchaplinsky), [lang-uk](https://lang.org.ua) project, [Nataliia Romanyshyn](https://github.com/romanyshyn-natalia), [Ukrainian Catholic University](https://ucu.edu.ua/en/), 2022