File size: 6,859 Bytes

---
license: cc-by-4.0
---

# Clean ConceptNet Data for All Languages

## Data Details

For our project on [Retrofitting Glove embeddings for Low Resource Languages](https://github.com/pyRis/retrofitting-embeddings-lrls/tree/main?tab=readme-ov-file), we extracted all data from the [ConceptNet](https://github.com/commonsense/conceptnet5/wiki/Downloads) database for 304 languages. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available [here](https://s3.amazonaws.com/conceptnet/downloads/2019/edges/conceptnet-assertions-5.7.0.csv.gz).

The final extracted dataset, available in another [HuggingFace repo](https://huggingface.co/datasets/DGurgurov/conceptnet_all), was used for training the graph embeddings using PPMI and consequently applying SVD on the co-occurence statistics of PPMI between the words.

We generate graph embeddings for 72 languages present in both CC100 and ConceptNet.

### Dataset Structure

Each file is a txt file with a word / phrase and corresponding embedding separated with a space.

Use the following function to read in the embeddings:

```python
def read_embeddings_from_text(file_path, embedding_size=300):
    """Function to read the embeddings from a txt file"""
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            parts = line.strip().split(' ')
            embedding_start_index = len(parts) - embedding_size
            phrase = ' '.join(parts[:embedding_start_index])
            embedding = np.array([float(val) for val in parts[embedding_start_index:]])
            embeddings[phrase] = embedding
    return embeddings
```

### Language Details

| Language Code | Language Name     | Vocabulary Size|
| ---           | ---               | ---            |
| af            | Afrikaans         | 12973          |
| sc            | Sardinian         | 573            |
| yo            | Yoruba            | 2283           |
| gn            | Guarani           | 131            |
| qu            | Quechua           | 5156           |
| li            | Limburgish        | 485            |
| ln            | Lingala           | 4109           |
| wo            | Wolof             | 1196           |
| zu            | Zulu              | 2758           |
| rm            | Romansh           | 3919           |
| ht            | Haitian Creole    | 2699           |
| su            | Sundanese         | 2514           |
| br            | Breton            | 11665          |
| gd            | Scottish Gaelic   | 14418          |
| xh            | Xhosa             | 2504           |
| mg            | Malagasy          | 26575          |
| jv            | Javanese          | 4919           |
| fy            | Frisian           | 7608           |
| sa            | Sanskrit          | 5789           |
| my            | Burmese           | 4875           |
| ug            | Uyghur            | 998            |
| yi            | Yiddish           | 8054           |
| or            | Oriya             | 109            |
| ha            | Hausa             | 802            |
| la            | Latin             | 848943         |
| sd            | Sindhi            | 143            |
| so            | Somali            | 593            |
| ku            | Kurdish           | 9737           |
| pa            | Punjabi           | 4488           |
| ps            | Pashto            | 1087           |
| ga            | Irish             | 29459          |
| am            | Amharic           | 1909           |
| km            | Khmer             | 3466           |
| uz            | Uzbek             | 5224           |
| ky            | Kyrgyz            | 3574           |
| cy            | Welsh             | 13243          |
| gu            | Gujarati          | 4427           |
| eo            | Esperanto         | 91074          |
| sw            | Swahili           | 9131           |
| mr            | Marathi           | 5545           |
| kn            | Kannada           | 3415           |
| ne            | Nepali            | 4224           |
| mn            | Mongolian         | 6740           |
| si            | Sinhala           | 2062           |
| te            | Telugu            | 18707          |
| be            | Belarusian        | 14871          |
| mk            | Macedonian        | 28935          |
| gl            | Galician          | 52824          |
| hy            | Armenian          | 23434          |
| is            | Icelandic         | 40287          |
| ml            | Malayalam         | 6750           |
| bn            | Bengali           | 7306           |
| ur            | Urdu              | 8476           |
| kk            | Kazakh            | 13700          |
| ka            | Georgian          | 25014          |
| az            | Azerbaijani       | 13277          |
| sq            | Albanian          | 16262          |
| ta            | Tamil             | 9064           |
| et            | Estonian          | 20088          |
| lv            | Latvian           | 30059          |
| ms            | Malay             | 88416          |
| sl            | Slovenian         | 89210          |
| lt            | Lithuanian        | 21184          |
| he            | Hebrew            | 27283          |
| sk            | Slovak            | 21657          |
| el            | Greek             | 39667          |
| th            | Thai              | 94281          |
| bg            | Bulgarian         | 171740         |
| da            | Danish            | 46600          |
| uk            | Ukrainian         | 27682          |
| ro            | Romanian          | 36206          |


### Licensing Information

This work includes data from ConceptNet 5, which was compiled by the
Commonsense Computing Initiative. ConceptNet 5 is freely available under
the Creative Commons Attribution-ShareAlike license (CC BY SA 3.0) from
http://conceptnet.io.

### Citation Information

```
@misc{gurgurov2024gremlinrepositorygreenbaseline,
      title={GrEmLIn: A Repository of Green Baseline Embeddings for 87 Low-Resource Languages Injected with Multilingual Graph Knowledge}, 
      author={Daniil Gurgurov and Rishu Kumar and Simon Ostermann},
      year={2024},
      eprint={2409.18193},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18193}, 
}

@paper{speer2017conceptnet,
    author = {Robyn Speer and Joshua Chin and Catherine Havasi},
    title = {ConceptNet 5.5: An Open Multilingual Graph of General Knowledge},
    conference = {AAAI Conference on Artificial Intelligence},
    year = {2017},
    pages = {4444--4451},
    keywords = {ConceptNet; knowledge graph; word embeddings},
    url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972}
}
```