DGurgurov's picture
Update README.md
6c94168 verified
metadata
license: cc-by-4.0

Clean ConceptNet Data for All Languages

Data Details

For our project on Retrofitting Glove embeddings for Low Resource Languages, we extracted all data from the ConceptNet database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available here.

The final extracted dataset, available in another HuggingFace repo, was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

Dataset Structure

Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

Use the following function to read in the embeddings:

def read_embeddings_from_text(file_path, embedding_size=300):
    """Function to read the embeddings from a txt file"""
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split(' ')
            embedding_start_index = len(parts) - embedding_size
            phrase = ' '.join(parts[:embedding_start_index])
            embedding = np.array([float(val) for val in parts[embedding_start_index:]])
            embeddings[phrase] = embedding
    return embeddings

Language Details

Language Code Language Name Vocabulary Size
af Afrikaans 12973
sc Sardinian 573
yo Yoruba 2283
gn Guarani 131
qu Quechua 5156
li Limburgish 485
ln Lingala 4109
wo Wolof 1196
zu Zulu 2758
rm Romansh 3919
ht Haitian Creole 2699
su Sundanese 2514
br Breton 11665
gd Scottish Gaelic 14418
xh Xhosa 2504
mg Malagasy 26575
jv Javanese 4919
fy Frisian 7608
sa Sanskrit 5789
my Burmese 4875
ug Uyghur 998
yi Yiddish 8054
or Oriya 109
ha Hausa 802
la Latin 848943
sd Sindhi 143
so Somali 593
ku Kurdish 9737
pa Punjabi 4488
ps Pashto 1087
ga Irish 29459
am Amharic 1909
km Khmer 3466
uz Uzbek 5224
ky Kyrgyz 3574
cy Welsh 13243
gu Gujarati 4427
eo Esperanto 91074
sw Swahili 9131
mr Marathi 5545
kn Kannada 3415
ne Nepali 4224
mn Mongolian 6740
si Sinhala 2062
te Telugu 18707
be Belarusian 14871
mk Macedonian 28935
gl Galician 52824
hy Armenian 23434
is Icelandic 40287
ml Malayalam 6750
bn Bengali 7306
ur Urdu 8476
kk Kazakh 13700
ka Georgian 25014
az Azerbaijani 13277
sq Albanian 16262
ta Tamil 9064
et Estonian 20088
lv Latvian 30059
ms Malay 88416
sl Slovenian 89210
lt Lithuanian 21184
he Hebrew 27283
sk Slovak 21657
el Greek 39667
th Thai 94281
bg Bulgarian 171740
da Danish 46600
uk Ukrainian 27682
ro Romanian 36206

Licensing Information

This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from http://conceptnet.io.

Citation Information

@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

@paper{speer2017conceptnet,
    author = {Robyn Speer and Joshua Chin and Catherine Havasi},
    title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
    conference = {AAAI Conference on Artificial Intelligence},
    year = {2017},
    pages = {4444--4451},
    keywords = {ConceptNet; knowledge graph; word embeddings},
    url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}