File size: 1,860 Bytes
6669f98
aeb8b4c
6669f98
 
 
953d8f4
6669f98
9a031b6
6669f98
 
 
 
0394939
6669f98
 
 
 
 
2fcfa8f
 
 
6669f98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: cc-by-nc-sa-4.0
language:
- ga
- mga
- la
pipeline_tag: feature-extraction
library_name: gensim
---

### Training Data

**Middle Irish FastText models** were trained on Middle Irish texts from [CELT](https://celt.ucc.ie/publishd.html). A text was included in the training dataset if "Middle Irish" or the dates "900-1200" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Middle Irish models can have some Old and Early Modern Irish words in the vocabulary, as well as some Latin due to code-switching.

### Available Models

There are 3 models in this familily:

- **Cased**, 70 402 words: `middle_irish_cased_ft_100_5_2.txt`
- **Lowercase**, 66 213 words: `middle_irish_lower_ft_100_5_2.txt`
- **Lowercase with initial mutations removed**, 60 094 words: `middle_irish_lower_demutated_ft_100_5_2.txt`

All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).

### Usage

```python
from gensim.models import KeyedVectors
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="ancatmara/middle-irish-ft-vectors", filename="middle_irish_cased_ft_100_5_2.txt")
model = KeyedVectors.load_word2vec_format(model_path, binary=False)

model.similar_by_word('Temra')
```

Out:
```python
>>> [('Temrach', 0.6949042677879333),
     ('Temraig', 0.6130734086036682),
     ('Temraich', 0.5354859828948975),
     ('Mide', 0.49614325165748596),
     ('Mumam', 0.49278897047042847),
     ('aenach', 0.4891957640647888),
     ('Midi', 0.4783679246902466),
     ('Muman', 0.47727957367897034),
     ('Lagen', 0.4697839319705963),
     ('Erenn', 0.4670616388320923)]
```