README.md · ancatmara/middle-irish-ft-vectors at 74f1fda4fd256506f585f37634fc27a5874341bc

metadata

license: cc
language:
  - ga
  - mga
pipeline_tag: feature-extraction

Training Data

Middle Irish FastText models were trained on Middle Irish texts from CELT. A text was included in the training dataset if "Middle Irish" or the dates "900-1200" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Middle Irish models can contain some Old and Early Modern Irish words.

Available Models

There are 3 models in this familily:

Cased: middle_irish_cased_ft_100_5_2.txt
Lowercase: middle_irish_lower_ft_100_5_2.txt
Lowercase with initial mutations removed: middle_irish_lower_demutated_ft_100_5_2.txt

All models are trained with the same hyperparameters (emb_size=100, window=5, min_count=2, n_epochs=100) and saved as KeyedVectors (see Gensim Documentation).

Usage

from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ancatmara/middle-irish-ft-vectors", filename="middle_irish_cased_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)

model.similar_by_word('Temra')

Out:

>>> [('Temrach', 0.6949042677879333),
     ('Temraig', 0.6130734086036682),
     ('Temraich', 0.5354859828948975),
     ('Mide', 0.49614325165748596),
     ('Mumam', 0.49278897047042847),
     ('aenach', 0.4891957640647888),
     ('Midi', 0.4783679246902466),
     ('Muman', 0.47727957367897034),
     ('Lagen', 0.4697839319705963),
     ('Erenn', 0.4670616388320923)]