|
--- |
|
license: mit |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
widget: |
|
- example_title: sample 1 |
|
src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3 |
|
- example_title: sample 2 |
|
src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3 |
|
- example_title: sample 3 |
|
src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3 |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper model using Bangla mozilla common voice dataset. |
|
For training this model used 40k training and 7k Validation of around 400 hours of data. We trained 12000 steps and get word |
|
error rate 4.58%. This model was whisper small[244 M] variant model. |
|
|
|
|
|
```py |
|
|
|
import os |
|
import librosa |
|
import torch |
|
import torchaudio |
|
import numpy as np |
|
|
|
from transformers import WhisperTokenizer |
|
from transformers import WhisperProcessor |
|
from transformers import WhisperFeatureExtractor |
|
from transformers import WhisperForConditionalGeneration |
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3" |
|
|
|
model_path = "bangla-speech-processing/BanglaASR" |
|
|
|
|
|
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path) |
|
tokenizer = WhisperTokenizer.from_pretrained(model_path) |
|
processor = WhisperProcessor.from_pretrained(model_path) |
|
model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device) |
|
|
|
|
|
speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3") |
|
speech_array = speech_array[0].numpy() |
|
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000) |
|
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features |
|
|
|
# batch = processor.feature_extractor.pad(input_features, return_tensors="pt") |
|
predicted_ids = model.generate(inputs=input_features.to(device))[0] |
|
|
|
|
|
transcription = processor.decode(predicted_ids, skip_special_tokens=True) |
|
|
|
print(transcription) |
|
|
|
``` |
|
|
|
|
|
# Dataset |
|
Used Mozilla common voice dataset around 400 hours data both training[40k] and validation[7k] mp3 samples. |
|
For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets) |
|
|
|
# Training Model Information |
|
|
|
|
|
| Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status | |
|
| ------------- | ------------- | -------- |-------- | ------------- | ------------- | -------- | |
|
tiny | 4 |384 | 6 | 39 M | X | X |
|
base | 6 |512 | 8 |74 M | X | X |
|
small | 12 |768 | 12 |244 M | ✓ | ✓ |
|
medium | 24 |1024 | 16 |769 M | X | X |
|
large | 32 |1280 | 20 |1550 M | X | X |
|
|
|
# Evaluation |
|
|
|
Word Error Rate 4.58 % |
|
|
|
For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main) |
|
|
|
``` |
|
@misc{BanglaASR , |
|
title={Transformer Based Whisper Bangla ASR Model}, |
|
author={Md Saiful Islam}, |
|
howpublished={}, |
|
year={2023} |
|
} |
|
``` |
|
|