metadata

license: cc-by-4.0

Clean ConceptNet Data for All Languages

Data Details

For our project on Retrofitting Glove embeddings for Low Resource Languages, we extracted all data from the ConceptNet database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available here.

The final extracted dataset, available in another HuggingFace repo, was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

Dataset Structure

Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

Use the following function to read in the embeddings:

def read_embeddings_from_text(file_path, embedding_size=300):
    """Function to read the embeddings from a txt file"""
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split(' ')
            embedding_start_index = len(parts) - embedding_size
            phrase = ' '.join(parts[:embedding_start_index])
            embedding = np.array([float(val) for val in parts[embedding_start_index:]])
            embeddings[phrase] = embedding
    return embeddings

Language Details

Language Code	Language Name	Vocabulary Size
af	Afrikaans	12973
sc	Sardinian	573
yo	Yoruba	2283
gn	Guarani	131
qu	Quechua	5156
li	Limburgish	485
ln	Lingala	4109
wo	Wolof	1196
zu	Zulu	2758
rm	Romansh	3919
ht	Haitian Creole	2699
su	Sundanese	2514
br	Breton	11665
gd	Scottish Gaelic	14418
xh	Xhosa	2504
mg	Malagasy	26575
jv	Javanese	4919
fy	Frisian	7608
sa	Sanskrit	5789
my	Burmese	4875
ug	Uyghur	998
yi	Yiddish	8054
or	Oriya	109
ha	Hausa	802
la	Latin	848943
sd	Sindhi	143
so	Somali	593
ku	Kurdish	9737
pa	Punjabi	4488
ps	Pashto	1087
ga	Irish	29459
am	Amharic	1909
km	Khmer	3466
uz	Uzbek	5224
ky	Kyrgyz	3574
cy	Welsh	13243
gu	Gujarati	4427
eo	Esperanto	91074
sw	Swahili	9131
mr	Marathi	5545
kn	Kannada	3415
ne	Nepali	4224
mn	Mongolian	6740
si	Sinhala	2062
te	Telugu	18707
be	Belarusian	14871
mk	Macedonian	28935
gl	Galician	52824
hy	Armenian	23434
is	Icelandic	40287
ml	Malayalam	6750
bn	Bengali	7306
ur	Urdu	8476
kk	Kazakh	13700
ka	Georgian	25014
az	Azerbaijani	13277
sq	Albanian	16262
ta	Tamil	9064
et	Estonian	20088
lv	Latvian	30059
ms	Malay	88416
sl	Slovenian	89210
lt	Lithuanian	21184
he	Hebrew	27283
sk	Slovak	21657
el	Greek	39667
th	Thai	94281
bg	Bulgarian	171740
da	Danish	46600
uk	Ukrainian	27682
ro	Romanian	36206

Licensing Information

This work includes data from ConceptNet 5, which was compiled by the Commonsense Computing Initiative. ConceptNet 5 is freely available under the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from http://conceptnet.io.

Citation Information

@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

@paper{speer2017conceptnet,
    author = {Robyn Speer and Joshua Chin and Catherine Havasi},
    title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
    conference = {AAAI Conference on Artificial Intelligence},
    year = {2017},
    pages = {4444--4451},
    keywords = {ConceptNet; knowledge graph; word embeddings},
    url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}