ancatmara
/

middle-irish-ft-vectors

Feature Extraction

Middle Irish (900-1200)

Model card Files Files and versions Community

ancatmara commited on Aug 14

Commit

6669f98

•

1 Parent(s): 5c15de6

Create README.md

Files changed (1) hide show

README.md +47 -0

README.md ADDED Viewed

	@@ -0,0 +1,47 @@

+---
+license: cc
+language:
+- ga
+- mga
+pipeline_tag: feature-extraction
+---
+### Training Data
+The models were trained on Middle Irish texts from [CELT](https://celt.ucc.ie/publishd.html). A text was included in the training dataset if "Middle Irish" or the dates "900-1200" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Middle Irish models can contain some Old and Early Modern Irish words.
+### Available Models
+There are 3 models in this familily:
+- **Cased**: `middle_irish_cased_ft_100_5_2.txt`
+- **Lowercase**: `middle_irish_lower_ft_100_5_2.txt`
+- **Lowercase with initial mutations removed**: `middle_irish_lower_demutated_ft_100_5_2.txt`
+All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).
+### Usage
+```python
+from gensim.models import KeyedVectors
+from huggingface_hub import hf_hub_download
+model_path = hf_hub_download(repo_id="ancatmara/middle-irish-ft-vectors", filename="middle_irish_cased_ft_100_5_2.txt")
+model = KeyedVectors.load_word2vec_format(model_path, binary=False)
+model.similar_by_word('Temra')
+```
+Out:
+```python
+>>> [('Temrach', 0.6949042677879333),
+     ('Temraig', 0.6130734086036682),
+     ('Temraich', 0.5354859828948975),
+     ('Mide', 0.49614325165748596),
+     ('Mumam', 0.49278897047042847),
+     ('aenach', 0.4891957640647888),
+     ('Midi', 0.4783679246902466),
+     ('Muman', 0.47727957367897034),
+     ('Lagen', 0.4697839319705963),
+     ('Erenn', 0.4670616388320923)]
+```