Lajavaness/wav2vec2-lg-xlsr-fr-speech-emotion-recognition · Could you let us know how you fine-tuned your model ?

Jun 11, 2024

Hi,
Thanks for sharing your fine-tuned model and its performance scores.
As you said in the read-me, your model maybe biased for French.
Therefore, to get a good result in another language, it needs to be finetuned with that language's dataset.
I will appreciate it if you could share you fine tuning method or even your script.
Thanks in advance
Yılmaz A.

Dynamos

Jun 14, 2024

I was wondering the exact same thing, I' dlike to adapt it to french canadian, so that would be a great fine tune job from your model!
Thanks
LP

yojul

Jun 17, 2024

Hi,

Thank you for your interest. The model has been trained mainly with Hugging Face's framework, so it should be quite straightforward to fine-tune the model using datasets and transformers libraries. Unfortunately, I can not share directly my training script, and anyway, hugging face packages had several updates during the last months so my script might be already obsolete.

However, I can suggest going through HF documentation for audio classification, there is pretty much all you need to build and train your own model on your data :

Building the dataset : https://huggingface.co/docs/datasets/audio_dataset
Training the model : https://huggingface.co/docs/transformers/tasks/audio_classification
And for this specific model you can also find more details in my master thesis report : https://odr.chalmers.se/items/42b85788-6f35-4cb1-b9c0-85a81d2d69ff

Overall, I used a quite simple training process with standard hyper-parameters for this type of tasks. The important things to keep in mind are :

Audio are mp3 files sampled at 16kHz (standard for voice processing) with varying length roughly from 1 to 5 seconds.
The model is trained as a multi-label classifier so I used Binary Cross Entropy loss (specifically BCEWithLogitsLoss from torch.)
If you are finetuning on a different language than French, I suggest not using directly my model and preferably fine tune a wav2vec2 model trained for the target language.
Main training hyper-parameters :
- learning_rate = 1e-4
- batch_size = 16
- train_epoch = 40
- weight_decay = 0.02 (to limit overfitting)

I hope it answers your question somehow. If you have additional specific questions let me know, I will try to answer.

Jules

Ayushdavidkushwahaaaa

Oct 27, 2024

what is Predicted emotion class ID: 2
what does that mean ..,,, which i found when ran the code on locally jupyter notebook (by downloading the files and versions)