openai/whisper · accuracy issues with retrained tiny model for marathi numbers

We have a use case of https://youtu.be/L3L4mEszzTs

Basically we want to simply build a 300 class classifier for numbers 1 through 300. We want to have our complete Android app under a couple of hundred MB including the model. Hence I picked the tiny model.

Approach

We collected audio samples - 50 for each number. For this experimentation we tried only numbers from 1 to 20 (not the complete 300 numbers). I prepared a dataset and pushed it to hub as: https://github.com/sameermahajan/whisper/blob/main/CreateDataset.py
As you can see I have given them labels corresponding to the numbers.
Then I retrained pretrained tiny model as: https://github.com/sameermahajan/whisper/blob/main/Retrain.py Note that I couldn't use the dataset directly but had to work off of my in memory copy due to https://huggingface.co/spaces/openai/whisper/discussions/78 I used only 100 epochs on a CPU for this experimentation.
Then I pushed this model to the hub at: https://huggingface.co/SameerMahajan/whisper-tiny-retrained
I tried this model on my same numbers for prediction as: https://github.com/sameermahajan/whisper/blob/main/MyMarathiModel.py
I understand that the predicted text will be off but should be corresponding to labels 1, 2,... 20 etc. However I see these values way off e.g.

./samples/6/6_33.wav {'text': "' Sa'am."}
./samples/6/6_34.wav {'text': "' Peace."}
./samples/6/6_35.wav {'text': "' Sa'am."}
./samples/6/6_36.wav {'text': "' Peace."}

I also tried this model using https://github.com/sameermahajan/whisper/blob/main/LiveDemo.py on my live audio and it is way off.

Any ideas what I am missing, any basic problem with the approach, how it can be addressed?

thanks,
Sameer

Various code snippets for this are on my github repo of https://github.com/sameermahajan/whisper if you want to review, try out, experiment etc.

@sanchit-gandhi ?