bangla-speech-processing
/

BanglaASR

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

BanglaASR / README.md

saiful9379's picture

update readme

8bf6008 over 1 year ago

|

3.14 kB

	---
	license: mit
	tags:
	- audio
	- automatic-speech-recognition
	widget:
	- example_title: sample 1
	src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3
	- example_title: sample 2
	src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31549899.mp3
	- example_title: sample 3
	src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
	pipeline_tag: automatic-speech-recognition
	---

	Bangla ASR model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper model using Bangla mozilla common voice dataset.
	For training this model used 40k training and 7k Validation of around 400 hours of data. We trained 12000 steps and get word
	error rate 4.58%. This model was whisper small[244 M] variant model.


	```py

	import os
	import librosa
	import torch
	import torchaudio
	import numpy as np

	from transformers import WhisperTokenizer
	from transformers import WhisperProcessor
	from transformers import WhisperFeatureExtractor
	from transformers import WhisperForConditionalGeneration

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"

	model_path = "bangla-speech-processing/BanglaASR"


	feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
	tokenizer = WhisperTokenizer.from_pretrained(model_path)
	processor = WhisperProcessor.from_pretrained(model_path)
	model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)


	speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
	speech_array = speech_array[0].numpy()
	speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
	input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features

	# batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
	predicted_ids = model.generate(inputs=input_features.to(device))[0]


	transcription = processor.decode(predicted_ids, skip_special_tokens=True)

	print(transcription)

	```


	# Dataset
	Used Mozilla common voice dataset around 400 hours data both training[40k] and validation[7k] mp3 samples.
	For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets)

	# Training Model Information


	\| Size \| Layers \| Width \| Heads \| Parameters \| Bangla-only \| Training Status \|
	\| ------------- \| ------------- \| -------- \|-------- \| ------------- \| ------------- \| -------- \|
	tiny \| 4 \|384 \| 6 \| 39 M \| X \| X
	base \| 6 \|512 \| 8 \|74 M \| X \| X
	small \| 12 \|768 \| 12 \|244 M \| ✓ \| ✓
	medium \| 24 \|1024 \| 16 \|769 M \| X \| X
	large \| 32 \|1280 \| 20 \|1550 M \| X \| X

	# Evaluation

	Word Error Rate 4.58 %

	For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main)

	```
	@misc{BanglaASR ,
	title={Transformer Based Whisper Bangla ASR Model},
	author={Md Saiful Islam},
	howpublished={},
	year={2023}
	}
	```