saiful9379 commited on
Commit
fb730d7
1 Parent(s): b3c405c

update readme

Browse files
Files changed (1) hide show
  1. README.md +78 -1
README.md CHANGED
@@ -11,4 +11,81 @@ widget:
11
  - example_title: sample 3
12
  src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
13
  pipeline_tag: automatic-speech-recognition
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - example_title: sample 3
12
  src: https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31617644.mp3
13
  pipeline_tag: automatic-speech-recognition
14
+ ---
15
+
16
+ Bangla ASR[Whisper BanglaASR] model which was trained Bangla Mozilla Common Voice Dataset. This is Fine-tuning Whisper for Bangla mozilla common voice dataset.
17
+ For training Bangla ASR model here used 40k traning and 7k Validation around 400 hours data. We trained 12000 steps this model and get word
18
+ error rate 4.58%.
19
+
20
+
21
+ ```py
22
+
23
+ import os
24
+ import librosa
25
+ import torch
26
+ import torchaudio
27
+ import numpy as np
28
+
29
+ from transformers import WhisperTokenizer
30
+ from transformers import WhisperProcessor
31
+ from transformers import WhisperFeatureExtractor
32
+ from transformers import WhisperForConditionalGeneration
33
+
34
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
35
+
36
+ mp3_path = "https://huggingface.co/bangla-speech-processing/BanglaASR/resolve/main/mp3/common_voice_bn_31515636.mp3"
37
+
38
+ model_path = "bangla-speech-processing/BanglaASR"
39
+
40
+
41
+ feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path)
42
+ tokenizer = WhisperTokenizer.from_pretrained(model_path)
43
+ processor = WhisperProcessor.from_pretrained(model_path)
44
+ model = WhisperForConditionalGeneration.from_pretrained(model_path).to(device)
45
+
46
+
47
+ speech_array, sampling_rate = torchaudio.load(mp3_path, format="mp3")
48
+ speech_array = speech_array[0].numpy()
49
+ speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
50
+ input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
51
+
52
+ # batch = processor.feature_extractor.pad(input_features, return_tensors="pt")
53
+ predicted_ids = model.generate(inputs=input_features.to(device))[0]
54
+
55
+
56
+ transcription = processor.decode(predicted_ids, skip_special_tokens=True)
57
+
58
+ print(transcription)
59
+
60
+ ```
61
+
62
+
63
+ # Dataset
64
+ Use Mozilla common voice dataset. we used 400 hours data both training 40k and validation 7k mp3 samples.
65
+ For more information about dataser please [click here](https://commonvoice.mozilla.org/bn/datasets)
66
+
67
+ # Training Model Information
68
+
69
+
70
+ | Size | Layers | Width | Heads | Parameters | Bangla-only | Training Status |
71
+ | ------------- | ------------- | -------- |-------- | ------------- | ------------- | -------- |
72
+ tiny | 4 |384 | 6 | 39 M | X | X
73
+ base | 6 |512 | 8 |74 M | X | X
74
+ small | 12 |768 | 12 |244 M | ✓ | ✓
75
+ medium | 24 |1024 | 16 |769 M | X | X
76
+ large | 32 |1280 | 20 |1550 M | X | X
77
+
78
+ # Evaluation
79
+
80
+ Word Error Rate 4.58 %
81
+
82
+ For More please check the [github](https://github.com/saiful9379/BanglaASR/tree/main)
83
+
84
+ ```
85
+ @misc{BanglaASR ,
86
+ title={Transformer Based Whisper Bangla ASR Model},
87
+ author={Md Saiful Islam},
88
+ howpublished={},
89
+ year={2023}
90
+ }
91
+ ```