Spaces:
Running
on
L4
accuracy issues with retrained tiny model for marathi numbers
We have a use case of https://youtu.be/L3L4mEszzTs
Basically we want to simply build a 300 class classifier for numbers 1 through 300. We want to have our complete Android app under a couple of hundred MB including the model. Hence I picked the tiny model.
Approach
- We collected audio samples - 50 for each number. For this experimentation we tried only numbers from 1 to 20 (not the complete 300 numbers). I prepared a dataset and pushed it to hub as: https://github.com/sameermahajan/whisper/blob/main/CreateDataset.py
As you can see I have given them labels corresponding to the numbers. - Then I retrained pretrained tiny model as: https://github.com/sameermahajan/whisper/blob/main/Retrain.py Note that I couldn't use the dataset directly but had to work off of my in memory copy due to https://huggingface.co/spaces/openai/whisper/discussions/78 I used only 100 epochs on a CPU for this experimentation.
- Then I pushed this model to the hub at: https://huggingface.co/SameerMahajan/whisper-tiny-retrained
- I tried this model on my same numbers for prediction as: https://github.com/sameermahajan/whisper/blob/main/MyMarathiModel.py
I understand that the predicted text will be off but should be corresponding to labels 1, 2,... 20 etc. However I see these values way off e.g.
./samples/6/6_33.wav {'text': "' Sa'am."}
./samples/6/6_34.wav {'text': "' Peace."}
./samples/6/6_35.wav {'text': "' Sa'am."}
./samples/6/6_36.wav {'text': "' Peace."}
- I also tried this model using https://github.com/sameermahajan/whisper/blob/main/LiveDemo.py on my live audio and it is way off.
Any ideas what I am missing, any basic problem with the approach, how it can be addressed?
thanks,
Sameer
Various code snippets for this are on my github repo of https://github.com/sameermahajan/whisper if you want to review, try out, experiment etc.
Answered in detail here: https://huggingface.co/spaces/openai/whisper/discussions/66#641da99e7e197635034c1822