File size: 5,246 Bytes

3acbfa0
 
affef11
 
4b8d9d8
affef11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3acbfa0
 
4b8d9d8
 
3acbfa0
4b8d9d8
3acbfa0
affef11
 
 
3acbfa0
4b8d9d8
3acbfa0
affef11
fe42c62
 
3acbfa0
 
affef11
 
 
 
 
 
 
 
 
 
 
3acbfa0
4b8d9d8
3acbfa0
affef11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3acbfa0
4b8d9d8
3acbfa0
affef11
3acbfa0
4b8d9d8
affef11
 
 
 
 
 
 
 
 
 
 
 
 
fe42c62
affef11
 
4b8d9d8
 
 
 
affef11
4b8d9d8
 
 
 
affef11
4b8d9d8
3acbfa0
a737cbe
f4fbc63
 
affef11
 
 
 
 
 
 
 
 
 
87009ee
affef11
 
4b8d9d8
3acbfa0
4b8d9d8
 
 
affef11

---
library_name: transformers
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- ASR
- Nepali ASR
- OpenSLR Nepali
- Nepali ASR Wav2Vec2
- XLS-R
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
#   # results: [16.82%, 2.72%]
#   results:
#     - task: speech_recognition
#       metrics: 
#        - metric: wer
#          value: 16.82%
#        - metric: cer
#          value: 2.72%
# model-index: 
#   - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
#     results: 
#       - task: 
#           name: speech_recognition
#         metrics: 
#           - type: wer
#             value: 16.82
#           - type: cer
#             value: 2.72
datasets:
- iamTangsang/OpenSLR54-Nepali-ASR
- mozilla-foundation/common_voice_17_0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
- ne
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Wav2Vec2_XLS-R-300m_Nepali_ASR

This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)

## Model description

The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%


## Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
- ### Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.

## Training and evaluation data

### Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances.
- We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation. 
- And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.)
- It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

### OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co/datasets/iamTangsang/OpenSLR54-Nepali-ASR).
- 80% was used for training, 10% for evaluation and 10% for testing.

## Training procedure

### Training on CommonVoice 17.0

The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP

### Initial Training on OpenSLR-54 for 16 epochs

The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP

### Further Training on OpenSLR-54 for further 3 epochs

We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP -->


### Framework versions

- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1