DGurgurov's picture
Update README.md
6c94168 verified
---
license: cc-by-4.0
---
# Clean ConceptNet Data for All Languages
## Data Details
For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).
The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.
We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.
### Dataset Structure
Each file is a txt file with a word / phrase and corresponding embedding separated with a space.
Use the following function to read in the embeddings:
```python
def read_embeddings_from_text(file_path, embedding_size=300):
"""Function to read the embeddings from a txt file"""
embeddings = {}
with open(file_path, 'r', encoding='utf-8') as file:
for line in file:
parts = line.strip().split(' ')
embedding_start_index = len(parts) - embedding_size
phrase = ' '.join(parts[:embedding_start_index])
embedding = np.array([float(val) for val in parts[embedding_start_index:]])
embeddings[phrase] = embedding
return embeddings
```
### Language Details
| Language Code | Language Name | Vocabulary Size|
| --- | --- | --- |
| af | Afrikaans | 12973 |
| sc | Sardinian | 573 |
| yo | Yoruba | 2283 |
| gn | Guarani | 131 |
| qu | Quechua | 5156 |
| li | Limburgish | 485 |
| ln | Lingala | 4109 |
| wo | Wolof | 1196 |
| zu | Zulu | 2758 |
| rm | Romansh | 3919 |
| ht | Haitian Creole | 2699 |
| su | Sundanese | 2514 |
| br | Breton | 11665 |
| gd | Scottish Gaelic | 14418 |
| xh | Xhosa | 2504 |
| mg | Malagasy | 26575 |
| jv | Javanese | 4919 |
| fy | Frisian | 7608 |
| sa | Sanskrit | 5789 |
| my | Burmese | 4875 |
| ug | Uyghur | 998 |
| yi | Yiddish | 8054 |
| or | Oriya | 109 |
| ha | Hausa | 802 |
| la | Latin | 848943 |
| sd | Sindhi | 143 |
| so | Somali | 593 |
| ku | Kurdish | 9737 |
| pa | Punjabi | 4488 |
| ps | Pashto | 1087 |
| ga | Irish | 29459 |
| am | Amharic | 1909 |
| km | Khmer | 3466 |
| uz | Uzbek | 5224 |
| ky | Kyrgyz | 3574 |
| cy | Welsh | 13243 |
| gu | Gujarati | 4427 |
| eo | Esperanto | 91074 |
| sw | Swahili | 9131 |
| mr | Marathi | 5545 |
| kn | Kannada | 3415 |
| ne | Nepali | 4224 |
| mn | Mongolian | 6740 |
| si | Sinhala | 2062 |
| te | Telugu | 18707 |
| be | Belarusian | 14871 |
| mk | Macedonian | 28935 |
| gl | Galician | 52824 |
| hy | Armenian | 23434 |
| is | Icelandic | 40287 |
| ml | Malayalam | 6750 |
| bn | Bengali | 7306 |
| ur | Urdu | 8476 |
| kk | Kazakh | 13700 |
| ka | Georgian | 25014 |
| az | Azerbaijani | 13277 |
| sq | Albanian | 16262 |
| ta | Tamil | 9064 |
| et | Estonian | 20088 |
| lv | Latvian | 30059 |
| ms | Malay | 88416 |
| sl | Slovenian | 89210 |
| lt | Lithuanian | 21184 |
| he | Hebrew | 27283 |
| sk | Slovak | 21657 |
| el | Greek | 39667 |
| th | Thai | 94281 |
| bg | Bulgarian | 171740 |
| da | Danish | 46600 |
| uk | Ukrainian | 27682 |
| ro | Romanian | 36206 |
### Licensing Information
This work includes data from ConceptNet 5, which was compiled by the
Commonsense Computing Initiative. ConceptNet 5 is freely available under
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
http://conceptnet.io.
### Citation Information
```
@misc{gurgurov2024gremlinrepositorygreenbaseline,
title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge},
author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
year={2024},
eprint={2409.18193},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.18193},
}
@paper{speer2017conceptnet,
author = {Robyn Speer and Joshua Chin and Catherine Havasi},
title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
conference = {AAAI Conference on Artificial Intelligence},
year = {2017},
pages = {4444--4451},
keywords = {ConceptNet; knowledge graph; word embeddings},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}
```