---
license: mit
tags:
- audio
- automatic-speech-recognition
widget:
- example_title: sample 1
  src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3
- example_title: sample 2
  src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3
- example_title: sample 3
  src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
pipeline_tag: automatic-speech-recognition
---

Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper for Bangla mozilla common voice dataset. 
For training Bangla ASR model here used 40k traning and 7k Validation around 400 hours data. We trained 12000 steps this model and get word 
error rate 4.58%. This model was fine-tune whisper small[244 M] model.


```py

import os
import librosa
import torch
import torchaudio
import numpy as np

from transformers import WhisperTokenizer
from transformers import WhisperProcessor
from transformers import WhisperFeatureExtractor
from transformers import WhisperForConditionalGeneration

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"

model_path = "bangla-speech-processing/BanglaASR"


feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
tokenizer = WhisperTokenizer.from_pretrained(model_path)
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)


speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
speech_array = speech_array[0].numpy()
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
predicted_ids = model.generate(inputs=input_features.to(device))[0]


transcription = processor.decode(predicted_ids, skip_special_tokens=True)

print(transcription)

```


# Dataset
Use Mozilla common voice dataset. we used 400 hours data both training 40k and validation 7k mp3 samples.
For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets)

# Training Model Information


| Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status |
| ------------- | ------------- | --------    |--------    | ------------- | ------------- | --------    |
tiny   | 4  |384  | 6   | 39 M 	| X |  X
base   | 6 	|512  | 8 	|74 M 	| X	|  X
small  | 12 |768  | 12 	|244 M 	| ✓ |  ✓ 
medium | 24 |1024 | 16 	|769 M 	| X |  X
large  | 32 |1280 | 20 	|1550 M | X |  X

# Evaluation

Word Error Rate 4.58 %

For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main)

```
@misc{BanglaASR ,
  title={Transformer Based Whisper Bangla ASR Model},
  author={Md Saiful Islam},
  howpublished={},
  year={2023}
}
```