iamTangsang's picture
Update README.md
35e610d verified
|
raw
history blame
5.24 kB
---
library_name: transformers
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- ASR
- Nepali ASR
- OpenSLR Nepali
- Nepali ASR Wav2Vec2
- XLS-R
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
# # results: [16.82%, 2.72%]
# results:
# - task: speech_recognition
# metrics:
# - metric: wer
# value: 16.82%
# - metric: cer
# value: 2.72%
# model-index:
# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
# results:
# - task:
# name: speech_recognition
# metrics:
# - type: wer
# value: 16.82
# - type: cer
# value: 2.72
datasets:
- iamTangsang/OpenSLR54-Nepali-ASR
- mozilla-foundation/common_voice_17_0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
license: mit
language:
- ne
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Wav2Vec2_XLS-R-300m_Nepali_ASR
This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on:
- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)
## Model description
The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
- WER on OpenSLR: 16.82%
- CER on OpenSLR: 2.72%
## Intended uses & limitations
- Research on Nepali ASR
- Transcriptions on Nepali audio
- Further Fine-tuning
- ### Limitations:
- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
- Numerals have been filtered out as well.
- The vocabulary doesn't contain all the Nepali alphabets.
- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
- May struggle with background noises and overlapping speech.
## Training and evaluation data
### Common Voice v17.0
- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
- Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated.
- We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances.
- We have preprocessed the data by removing all punctuations and symbols.
- Then, we used 80% of the total utterances for training and 10% for evaluation.
- And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.)
- It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.
### OpenSLR Nepali ASR training data
- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
- And, segments longer than 5 seconds were removed because of resource limitations.
- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co/datasets/iamTangsang/OpenSLR54-Nepali-ASR).
- 80% was used for training, 10% for evaluation and 10% for testing.
## Training procedure
### Training on CommonVoice 17.0
The following hyperparameters were used during training:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 400
- num_epochs: 30
- mixed_precision_training: Native AMP
### Initial Training on OpenSLR-54 for 16 epochs
The following hyperparameters were used:
- learning_rate: 3e-04
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- warmup_steps: 500
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 16
- mixed_precision_training: Native AMP
### Further Training on OpenSLR-54 for further 3 epochs
We used the following:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 700
- num_epochs: 3
- mixed_precision_training: Native AMP
### Framework versions
- Transformers 4.44.2
- Pytorch 2.4.1+cu121
- Datasets 3.0.0
- Tokenizers 0.19.1