File size: 6,859 Bytes
3b32078
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
401e133
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b32078
 
 
 
 
 
 
 
 
 
 
 
6c94168
 
41065be
 
 
 
 
 
 
 
3b32078
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: cc-by-4.0
---

# Clean ConceptNet Data for All Languages

## Data Details

For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).

The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

### Dataset Structure

Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

Use the following function to read in the embeddings:

```python
def read_embeddings_from_text(file_path, embedding_size=300):
    """Function to read the embeddings from a txt file"""
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split(' ')
            embedding_start_index = len(parts) - embedding_size
            phrase = ' '.join(parts[:embedding_start_index])
            embedding = np.array([float(val) for val in parts[embedding_start_index:]])
            embeddings[phrase] = embedding
    return embeddings
```

### Language Details

| Language Code | Language Name     | Vocabulary Size|
| ---           | ---               | ---            |
| af            | Afrikaans         | 12973          |
| sc            | Sardinian         | 573            |
| yo            | Yoruba            | 2283           |
| gn            | Guarani           | 131            |
| qu            | Quechua           | 5156           |
| li            | Limburgish        | 485            |
| ln            | Lingala           | 4109           |
| wo            | Wolof             | 1196           |
| zu            | Zulu              | 2758           |
| rm            | Romansh           | 3919           |
| ht            | Haitian Creole    | 2699           |
| su            | Sundanese         | 2514           |
| br            | Breton            | 11665          |
| gd            | Scottish Gaelic   | 14418          |
| xh            | Xhosa             | 2504           |
| mg            | Malagasy          | 26575          |
| jv            | Javanese          | 4919           |
| fy            | Frisian           | 7608           |
| sa            | Sanskrit          | 5789           |
| my            | Burmese           | 4875           |
| ug            | Uyghur            | 998            |
| yi            | Yiddish           | 8054           |
| or            | Oriya             | 109            |
| ha            | Hausa             | 802            |
| la            | Latin             | 848943         |
| sd            | Sindhi            | 143            |
| so            | Somali            | 593            |
| ku            | Kurdish           | 9737           |
| pa            | Punjabi           | 4488           |
| ps            | Pashto            | 1087           |
| ga            | Irish             | 29459          |
| am            | Amharic           | 1909           |
| km            | Khmer             | 3466           |
| uz            | Uzbek             | 5224           |
| ky            | Kyrgyz            | 3574           |
| cy            | Welsh             | 13243          |
| gu            | Gujarati          | 4427           |
| eo            | Esperanto         | 91074          |
| sw            | Swahili           | 9131           |
| mr            | Marathi           | 5545           |
| kn            | Kannada           | 3415           |
| ne            | Nepali            | 4224           |
| mn            | Mongolian         | 6740           |
| si            | Sinhala           | 2062           |
| te            | Telugu            | 18707          |
| be            | Belarusian        | 14871          |
| mk            | Macedonian        | 28935          |
| gl            | Galician          | 52824          |
| hy            | Armenian          | 23434          |
| is            | Icelandic         | 40287          |
| ml            | Malayalam         | 6750           |
| bn            | Bengali           | 7306           |
| ur            | Urdu              | 8476           |
| kk            | Kazakh            | 13700          |
| ka            | Georgian          | 25014          |
| az            | Azerbaijani       | 13277          |
| sq            | Albanian          | 16262          |
| ta            | Tamil             | 9064           |
| et            | Estonian          | 20088          |
| lv            | Latvian           | 30059          |
| ms            | Malay             | 88416          |
| sl            | Slovenian         | 89210          |
| lt            | Lithuanian        | 21184          |
| he            | Hebrew            | 27283          |
| sk            | Slovak            | 21657          |
| el            | Greek             | 39667          |
| th            | Thai              | 94281          |
| bg            | Bulgarian         | 171740         |
| da            | Danish            | 46600          |
| uk            | Ukrainian         | 27682          |
| ro            | Romanian          | 36206          |


### Licensing Information

This work includes data from ConceptNet 5, which was compiled by the
Commonsense Computing Initiative. ConceptNet 5 is freely available under
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
http://conceptnet.io.

### Citation Information

```
@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

@paper{speer2017conceptnet,
    author = {Robyn Speer and Joshua Chin and Catherine Havasi},
    title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
    conference = {AAAI Conference on Artificial Intelligence},
    year = {2017},
    pages = {4444--4451},
    keywords = {ConceptNet; knowledge graph; word embeddings},
    url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}
```