Wav2Vec2_XLS-R-300m_Nepali_ASR

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on:

[Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
[Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)

Model description

The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.

WER on OpenSLR: 16.82%
CER on OpenSLR: 2.72%

Intended uses & limitations

Research on Nepali ASR
Transcriptions on Nepali audio
Further Fine-tuning
Limitations:
The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
Numerals have been filtered out as well.
The vocabulary doesn't contain all the Nepali alphabets.
Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
May struggle with background noises and overlapping speech.

Training and evaluation data

Common Voice v17.0

This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
Initially, the model was trained on CommonVoice v17.0 ne-NP which consists of about 2 hours of voice data of which 1 hours have been manually validated.
We combined the validated and other split first since the dataset is very small. So, we had a total of 1337 utterances.
We have preprocessed the data by removing all punctuations and symbols.
Then, we used 80% of the total utterances for training and 10% for evaluation.
And, we used the test split consisting of 217 utterances for testing. (It might have been present in the train split as well.)
It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

OpenSLR Nepali ASR training data

Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
And, segments longer than 5 seconds were removed because of resource limitations.
Less frequently used 'alphabets' were removed to reduce the vocabulary size.
Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded here.
80% was used for training, 10% for evaluation and 10% for testing.

Training procedure

Training on CommonVoice 17.0

The following hyperparameters were used during training:

learning_rate: 3e-04
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 400
num_epochs: 30
mixed_precision_training: Native AMP

Initial Training on OpenSLR-54 for 16 epochs

The following hyperparameters were used:

learning_rate: 3e-04
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
warmup_steps: 500
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 16
mixed_precision_training: Native AMP

Further Training on OpenSLR-54 for further 3 epochs

We used the following:

learning_rate: 2e-5
train_batch_size: 16
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 700
num_epochs: 3
mixed_precision_training: Native AMP

Framework versions

Transformers 4.44.2
Pytorch 2.4.1+cu121
Datasets 3.0.0
Tokenizers 0.19.1

iamTangsang
/

Wav2Vec2_XLS-R-300m_Nepali_ASR