Update README.md

35e610d verified 29 days ago

5.24 kB

	---
	library_name: transformers
	base_model:
	- facebook/wav2vec2-xls-r-300m
	tags:
	- ASR
	- Nepali ASR
	- OpenSLR Nepali
	- Nepali ASR Wav2Vec2
	- XLS-R
	# model-index:
	# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
	# # results: [16.82%, 2.72%]
	# results:
	# - task: speech_recognition
	# metrics:
	# - metric: wer
	# value: 16.82%
	# - metric: cer
	# value: 2.72%
	# model-index:
	# - name: Wav2Vec2_XLS-R-300m_Nepali_ASR
	# results:
	# - task:
	# name: speech_recognition
	# metrics:
	# - type: wer
	# value: 16.82
	# - type: cer
	# value: 2.72
	datasets:
	- iamTangsang/OpenSLR54-Nepali-ASR
	- mozilla-foundation/common_voice_17_0
	metrics:
	- wer
	- cer
	pipeline_tag: automatic-speech-recognition
	license: mit
	language:
	- ne
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Wav2Vec2_XLS-R-300m_Nepali_ASR

	This model is a fine-tuned version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on:
	- [Large Nepali ASR training data set from OpenSLR (SLR-54)] (https://www.openslr.org/54/)
	- [Common Voice Corpus 17.0] (https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)

	## Model description

	The model is a fine-tuned version of Wav2Vec2 XLS-R 300 million parameters version fine-tuned for Nepali Automatic Speech Recognition. The reported results are on the OpenSLR test split.
	- WER on OpenSLR: 16.82%
	- CER on OpenSLR: 2.72%


	## Intended uses & limitations
	- Research on Nepali ASR
	- Transcriptions on Nepali audio
	- Further Fine-tuning
	- ### Limitations:
	- The model is trained on the OpenSLR Nepali ASR dataset which upon inspection was found to be quite noisy and inconsistent.
	- Due to resources limitations, utterances longer than 5 sec have been filtered out from the dataset during training and evaluation.
	- Numerals have been filtered out as well.
	- The vocabulary doesn't contain all the Nepali alphabets.
	- Might perform poorly on audio segments longer than 5 seconds. Or, needs some mechanism to segment audio into 5 seconds chunks which might increase processing time.
	- May struggle with background noises and overlapping speech.

	## Training and evaluation data

	### Common Voice v17.0
	- This model has been fine-tuned on OpenSLR-54 (Nepali ASR training dataset) and CommonVoice Corpus v17.0
	- Initially, the model was trained on [CommonVoice v17.0 ne-NP](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0/viewer/ne-NP) which consists of about 2 hours of voice data of which 1 hours have been manually validated.
	- We combined the `validated` and `other` split first since the dataset is very small. So, we had a total of 1337 utterances.
	- We have preprocessed the data by removing all punctuations and symbols.
	- Then, we used 80% of the total utterances for training and 10% for evaluation.
	- And, we used the `test` split consisting of 217 utterances for testing. (It might have been present in the `train split` as well.)
	- It was trained for 30 epochs. The WER started fluctuating around 37% to 39%.

	### OpenSLR Nepali ASR training data
	- Then, it was further trained on the larger OpenSLR Nepali ASR training dataset which has 157,000 utterances.
	- Firstly, the numerals were removed as the utterances were inconsistent with transcriptions.
	- And, segments longer than 5 seconds were removed because of resource limitations.
	- Less frequently used 'alphabets' were removed to reduce the vocabulary size.
	- Finally, we ended up with 136083 utterances for whole dataset. The dataset has been uploaded [here](https://huggingface.co/datasets/iamTangsang/OpenSLR54-Nepali-ASR).
	- 80% was used for training, 10% for evaluation and 10% for testing.

	## Training procedure

	### Training on CommonVoice 17.0

	The following hyperparameters were used during training:
	- learning_rate: 3e-04
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 400
	- num_epochs: 30
	- mixed_precision_training: Native AMP

	### Initial Training on OpenSLR-54 for 16 epochs

	The following hyperparameters were used:
	- learning_rate: 3e-04
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- warmup_steps: 500
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 16
	- mixed_precision_training: Native AMP

	### Further Training on OpenSLR-54 for further 3 epochs

	We used the following:
	- learning_rate: 2e-5
	- train_batch_size: 16
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 700
	- num_epochs: 3
	- mixed_precision_training: Native AMP


	### Framework versions

	- Transformers 4.44.2
	- Pytorch 2.4.1+cu121
	- Datasets 3.0.0
	- Tokenizers 0.19.1