fasttext_uk / README.md
dchaplinsky's picture
Update README.md
1149dad
|
raw
history blame
2.13 kB
metadata
license: mit
tags:
  - feature-extraction
library_name: generic
datasets:
  - ubertext2.0
widget:
  - text: доброго вечора ми з україни

skipgram.uk.300.bin is pre-trained word vectors for the Ukrainian language, trained with fastText on (yet unreleased) UberText2.0 dataset, collected and processed by the lang-uk. This model was trained using skipgram in dimension 300, with character n-grams range of 2-5, and 15 negative samples.

Our model increases Accuracy by 6.3% compared to the Facebook Ukrainian word vectors on the word analogy task. The dataset for Ukrainian word analogy is available here.

Extrinsic evaluations were performed on two sequence labeling tasks: NER and POS tagging. NER-UK dataset was released by the lang-uk, and Ukrainian (UD) corpus was developed by a non-profit organization Institute for Ukrainian.

Results:

  1. spaCy NER F-score: 0.818
  2. POS Flair Accuracy: 0.824
  3. POS spaCy Accuracy: 0.911

Usage

import fasttext.util
ft = fasttext.load_model('skipgram.uk.300.bin')
ft.get_word_vector('привіт')
@inproceedings{romanyshyn-etal-2023-learning,
    title = "Learning Word Embeddings for {U}krainian: A Comparative Study of Fast{T}ext Hyperparameters",
    author = "Romanyshyn, Nataliia  and
      Chaplynskyi, Dmytro  and
      Zakharov, Kyrylo",
    booktitle = "Proceedings of the Second Ukrainian Natural Language Processing Workshop",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.unlp-1.3",
    pages = "20--31",
}

Copyright: Dmytro Chaplynskyi, lang-uk project, Nataliia Romanyshyn, Ukrainian Catholic University, 2022