ancatmara commited on
Commit
6669f98
1 Parent(s): 5c15de6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +47 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ language:
4
+ - ga
5
+ - mga
6
+ pipeline_tag: feature-extraction
7
+ ---
8
+
9
+ ### Training Data
10
+
11
+ The models were trained on Middle Irish texts from [CELT](https://celt.ucc.ie/publishd.html). A text was included in the training dataset if "Middle Irish" or the dates "900-1200" were explicitely mentioned in its metadata on CELT, including texts marked as "Old and Middle Irish" or "Old, Middle and Early Modern Irish". Therefore, Middle Irish models can contain some Old and Early Modern Irish words.
12
+
13
+ ### Available Models
14
+
15
+ There are 3 models in this familily:
16
+
17
+ - **Cased**: `middle_irish_cased_ft_100_5_2.txt`
18
+ - **Lowercase**: `middle_irish_lower_ft_100_5_2.txt`
19
+ - **Lowercase with initial mutations removed**: `middle_irish_lower_demutated_ft_100_5_2.txt`
20
+
21
+ All models are trained with the same hyperparameters (`emb_size=100, window=5, min_count=2, n_epochs=100`) and saved as `KeyedVectors` (see [Gensim Documentation](https://radimrehurek.com/gensim/models/keyedvectors.html)).
22
+
23
+ ### Usage
24
+
25
+ ```python
26
+ from gensim.models import KeyedVectors
27
+ from huggingface_hub import hf_hub_download
28
+
29
+ model_path = hf_hub_download(repo_id="ancatmara/middle-irish-ft-vectors", filename="middle_irish_cased_ft_100_5_2.txt")
30
+ model = KeyedVectors.load_word2vec_format(model_path, binary=False)
31
+
32
+ model.similar_by_word('Temra')
33
+ ```
34
+
35
+ Out:
36
+ ```python
37
+ >>> [('Temrach', 0.6949042677879333),
38
+ ('Temraig', 0.6130734086036682),
39
+ ('Temraich', 0.5354859828948975),
40
+ ('Mide', 0.49614325165748596),
41
+ ('Mumam', 0.49278897047042847),
42
+ ('aenach', 0.4891957640647888),
43
+ ('Midi', 0.4783679246902466),
44
+ ('Muman', 0.47727957367897034),
45
+ ('Lagen', 0.4697839319705963),
46
+ ('Erenn', 0.4670616388320923)]
47
+ ```