Edit model card

Wav2Vec2_XLS-R-300m_Nepali_ASR

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on:

Model description

The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.

  • WER on OpenSLR: 16.82%
  • CER on OpenSLR: 2.72%

Intended uses & limitations

  • Research on Nepali ASR
  • Transcriptions on Nepali audio
  • Further Fine-tuning
  • Limitations:

  • The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
  • Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
  • Numerals have been filtered out as well.
  • The vocabulary doesn't contain all the Nepali alphabets.
  • Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
  • May struggle with background noises and overlapping speech.

Training and evaluation data

Common Voice v17.0

  • This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
  • Initially, the model was trained on CommonVoice v17.0 ne-NP which consists of about 2 hours of voice data of which 1 hours have been manually validated.
  • We combined the validated and other split first since the dataset is very small. So, we had a total of 1337 utterances.
  • We have preprocessed the data by removing all punctuations and symbols.
  • Then, we used 80% of the total utterances for training and 10% for evaluation.
  • And, we used the test split consisting of 217 utterances for testing. (It might have been present in the train split as well.)
  • It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

OpenSLR Nepali ASR training data

  • Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
  • Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
  • And, segments longer than 5 seconds were removed because of resource limitations.
  • Less frequently used 'alphabets' were removed to reduce the vocabulary size.
  • Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded here.
  • 80% was used for training, 10% for evaluation and 10% for testing.

Training procedure

Training on CommonVoice 17.0

The following hyperparameters were used during training:

  • learning_rate: 3e-04
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 400
  • num_epochs: 30
  • mixed_precision_training: Native AMP

Initial Training on OpenSLR-54 for 16 epochs

The following hyperparameters were used:

  • learning_rate: 3e-04
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • warmup_steps: 500
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 16
  • mixed_precision_training: Native AMP

Further Training on OpenSLR-54 for further 3 epochs

We used the following:

  • learning_rate: 2e-5
  • train_batch_size: 16
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 700
  • num_epochs: 3
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.44.2
  • Pytorch 2.4.1+cu121
  • Datasets 3.0.0
  • Tokenizers 0.19.1
Downloads last month
5
Safetensors
Model size
316M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR

Finetuned
(372)
this model

Datasets used to train iamTangsang/Wav2Vec2_XLS-R-300m_Nepali_ASR