Word2Bezbar: Word2Vec Models for French Rap Lyrics

Overview

Word2Bezbar are Word2Vec models trained on french rap lyrics sourced from Genius. Tokenization has been done using NLTK french word_tokenze function, with a prior processing to remove french oral contractions. Used dataset size was 323MB, corresponding to 77M tokens.

The model captures the semantic relationships between words in the context of french rap, providing a useful tool for studies associated to french slang and music lyrics analysis.

Model Details

Size of this model is small

Parameter	Value
Dimensionality	100
Window Size	5
Epochs	10
Algorithm	CBOW

Versions

This model has been trained with the followed software versions

Requirement	Version
Python	3.8.5
Gensim library	4.3.2
NTLK library	3.8.1

Installation

Install Required Python Libraries:
```
pip install gensim
```

Clone the Repository:

git clone https://github.com/rapminerz/Word2Bezbar-small.git

Navigate to the Model Directory:
```
cd Word2Bezbar-small
```

Loading the Model

To load the Word2Bezbar Word2Vec model, use the following Python code:

import gensim

# Load the Word2Vec model
model = gensim.models.Word2Vec.load("word2vec.model")

Using the Model

Once the model is loaded, you can use it as shown:

To get the most similary words regarding a word

model.wv.most_similar("bendo")
[('binks', 0.8920747637748718),
 ('bando', 0.8460732698440552),
 ('hood', 0.8299438953399658),
 ('tieks', 0.8264378309249878),
 ('hall', 0.817583441734314),
 ('secteur', 0.8145656585693359),
 ('barrio', 0.809047281742096),
 ('block', 0.793493390083313),
 ('bâtiment', 0.7826434969902039),
 ('bloc', 0.7753982543945312)]

model.wv.most_similar("kichta")
[('liasse', 0.878665566444397),
 ('sse-lia', 0.8552991151809692),
 ('kishta', 0.8535938262939453),
 ('kich', 0.7646669149398804),
 ('skalape', 0.7576569318771362),
 ('moula', 0.7466527223587036),
 ('valise', 0.7429592609405518),
 ('sacoche', 0.7324921488761902),
 ('mallette', 0.7247079014778137),
 ('re-pai', 0.7060815095901489)]

To find the word that doesn't match in a list of words

model.wv.doesnt_match(["racli","gow","gadji","fimbi","boug"])
'boug'

model.wv.doesnt_match(["Zidane","Mbappé","Ronaldo","Messi","Jordan"])
'Jordan'

To find the similarity between two words

model.wv.similarity("kichta", "moula")
0.7466528

model.wv.similarity("bonheur", "moula")
0.16985293

Or even get the vector representation of a word

model.wv['ekip']
array([ 1.4757039e-01,  ... 1.1260221e+00],
      dtype=float32)

Purpose and Disclaimer

This model is designed for academic and research purposes only. It is not intended for commercial use. The creators of this model do not endorse or promote any specific views or opinions that may be represented in the dataset.

Please mention @RapMinerz if you use our models

Contact

For any questions or issues, please contact the repository owner, RapMinerz, at rapminerz.contact@gmail.com.