Clean ConceptNet Data for All Languages
Data Details
For our project on Retrofitting Glove embeddings for Low Resource Languages, we extracted all data from the ConceptNet database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available here.
The final extracted dataset, available in another HuggingFace repo, was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.
We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.
Dataset Structure
Each file is a txt file with a word / phrase and corresponding embedding separated with a space.
Use the following function to read in the embeddings:
def read_embeddings_from_text(file_path, embedding_size=300):
"""Function to read the embeddings from a txt file"""
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
parts = line.strip().split(' ')
embedding_start_index = len(parts) - embedding_size
phrase = ' '.join(parts[:embedding_start_index])
embedding = np.array([float(val) for val in parts[embedding_start_index:]])
embeddings[phrase] = embedding
return embeddings
Language Details
Language Code | Language Name | Vocabulary Size |
---|---|---|
af | Afrikaans | 12973 |
sc | Sardinian | 573 |
yo | Yoruba | 2283 |
gn | Guarani | 131 |
qu | Quechua | 5156 |
li | Limburgish | 485 |
ln | Lingala | 4109 |
wo | Wolof | 1196 |
zu | Zulu | 2758 |
rm | Romansh | 3919 |
ht | Haitian Creole | 2699 |
su | Sundanese | 2514 |
br | Breton | 11665 |
gd | Scottish Gaelic | 14418 |
xh | Xhosa | 2504 |
mg | Malagasy | 26575 |
jv | Javanese | 4919 |
fy | Frisian | 7608 |
sa | Sanskrit | 5789 |
my | Burmese | 4875 |
ug | Uyghur | 998 |
yi | Yiddish | 8054 |
or | Oriya | 109 |
ha | Hausa | 802 |
la | Latin | 848943 |
sd | Sindhi | 143 |
so | Somali | 593 |
ku | Kurdish | 9737 |
pa | Punjabi | 4488 |
ps | Pashto | 1087 |
ga | Irish | 29459 |
am | Amharic | 1909 |
km | Khmer | 3466 |
uz | Uzbek | 5224 |
ky | Kyrgyz | 3574 |
cy | Welsh | 13243 |
gu | Gujarati | 4427 |
eo | Esperanto | 91074 |
sw | Swahili | 9131 |
mr | Marathi | 5545 |
kn | Kannada | 3415 |
ne | Nepali | 4224 |
mn | Mongolian | 6740 |
si | Sinhala | 2062 |
te | Telugu | 18707 |
be | Belarusian | 14871 |
mk | Macedonian | 28935 |
gl | Galician | 52824 |
hy | Armenian | 23434 |
is | Icelandic | 40287 |
ml | Malayalam | 6750 |
bn | Bengali | 7306 |
ur | Urdu | 8476 |
kk | Kazakh | 13700 |
ka | Georgian | 25014 |
az | Azerbaijani | 13277 |
sq | Albanian | 16262 |
ta | Tamil | 9064 |
et | Estonian | 20088 |
lv | Latvian | 30059 |
ms | Malay | 88416 |
sl | Slovenian | 89210 |
lt | Lithuanian | 21184 |
he | Hebrew | 27283 |
sk | Slovak | 21657 |
el | Greek | 39667 |
th | Thai | 94281 |
bg | Bulgarian | 171740 |
da | Danish | 46600 |
uk | Ukrainian | 27682 |
ro | Romanian | 36206 |
Licensing Information
This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from http://conceptnet.io.
Citation Information
@misc{gurgurov2024gremlinrepositorygreenbaseline,
title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
year={2024},
eprint={2409.18193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.18193},
}
@paper{speer2017conceptnet,
author = {Robyn Speer and Joshua Chin and Catherine Havasi},
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
conference = {AAAI Conference on Artificial Intelligence},
year = {2017},
pages = {4444--4451},
keywords = {ConceptNet; knowledge graph; word embeddings},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}