poonehmousavi commited on
Commit
3527ba3
1 Parent(s): e82a61a

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +132 -0
  2. config.json +3 -0
  3. hyperparams.yaml +77 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fa
4
+ thumbnail: null
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - whisper
8
+ - pytorch
9
+ - speechbrain
10
+ - Transformer
11
+ - hf-asr-leaderboard
12
+ license: apache-2.0
13
+ datasets:
14
+ - commonvoice
15
+ metrics:
16
+ - wer
17
+ - cer
18
+ model-index:
19
+ - name: asr-whisper-large-v2-commonvoice-ar
20
+ results:
21
+ - task:
22
+ name: Automatic Speech Recognition
23
+ type: automatic-speech-recognition
24
+ dataset:
25
+ name: CommonVoice 10.0 (Persian)
26
+ type: mozilla-foundation/common_voice_10_0
27
+ config: fa
28
+ split: test
29
+ args:
30
+ language: fa
31
+ metrics:
32
+ - name: Test WER
33
+ type: wer
34
+ value: '31.75'
35
+ ---
36
+
37
+ <iframe src="https://ghbtns.com/github-btn.html?user=speechbrain&repo=speechbrain&type=star&count=true&size=large&v=2" frameborder="0" scrolling="0" width="170" height="30" title="GitHub"></iframe>
38
+ <br/><br/>
39
+
40
+ # whisper large-v2 fine-tuned on CommonVoice Persian
41
+
42
+ This repository provides all the necessary tools to perform automatic speech
43
+ recognition from an end-to-end whisper model fine-tuned on CommonVoice (Persian Language) within
44
+ SpeechBrain. For a better experience, we encourage you to learn more about
45
+ [SpeechBrain](https://speechbrain.github.io).
46
+
47
+ The performance of the model is the following:
48
+
49
+ | Release | Test CER | Test WER | GPUs |
50
+ |:-------------:|:--------------:|:--------------:| :--------:|
51
+ | 01-02-23 | 9.38 | 31.75 | 1xV100 16GB |
52
+
53
+ ## Pipeline description
54
+
55
+ This ASR system is composed of whisper encoder-decoder blocks:
56
+ - The pretrained whisper-large-v2 encoder is frozen.
57
+ - The pretrained Whisper tokenizer is used.
58
+ - A pretrained Whisper-large-v2 decoder ([openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)) is finetuned on CommonVoice Fa.
59
+ The obtained final acoustic representation is given to the greedy decoder.
60
+
61
+ The system is trained with recordings sampled at 16kHz (single channel).
62
+ The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
63
+
64
+ ## Install SpeechBrain
65
+
66
+ First of all, please install tranformers and SpeechBrain with the following command:
67
+
68
+ ```
69
+ pip install speechbrain transformers
70
+ ```
71
+
72
+ Please notice that we encourage you to read our tutorials and learn more about
73
+ [SpeechBrain](https://speechbrain.github.io).
74
+
75
+ ### Transcribing your own audio files (in Persian)
76
+
77
+ ```python
78
+
79
+ from speechbrain.pretrained import WhisperASR
80
+
81
+ asr_model = WhisperASR.from_hparams(source="speechbrain/asr-whisper-large-v2-commonvoice-fa", savedir="retrained_models/asr-whisper-large-v2-commonvoice-fa")
82
+ asr_model.transcribe_file("speechbrain/asr-whisper-large-v2-commonvoice-ar/example-fa.mp3")
83
+
84
+
85
+ ```
86
+ ### Inference on GPU
87
+ To perform inference on the GPU, add `run_opts={"device":"cuda"}` when calling the `from_hparams` method.
88
+
89
+ ### Training
90
+ The model was trained with SpeechBrain.
91
+ To train it from scratch follow these steps:
92
+ 1. Clone SpeechBrain:
93
+ ```bash
94
+ git clone https://github.com/speechbrain/speechbrain/
95
+ ```
96
+ 2. Install it:
97
+ ```bash
98
+ cd speechbrain
99
+ pip install -r requirements.txt
100
+ pip install -e .
101
+ ```
102
+
103
+ 3. Run Training:
104
+ ```bash
105
+ cd recipes/CommonVoice/ASR/transformer/
106
+ python train_with_whisper.py hparams/train_fa_hf_whisper.yaml --data_folder=your_data_folder
107
+ ```
108
+
109
+ You can find our training results (models, logs, etc) [here](https://drive.google.com/drive/folders/1nzMMYmB5SxMKsFUk-rM9_ijcqzia8pX7).
110
+
111
+ ### Limitations
112
+ The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
113
+
114
+ #### Referencing SpeechBrain
115
+
116
+ ```
117
+ @misc{SB2021,
118
+ author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
119
+ title = {SpeechBrain},
120
+ year = {2021},
121
+ publisher = {GitHub},
122
+ journal = {GitHub repository},
123
+ howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}},
124
+ }
125
+ ```
126
+
127
+ #### About SpeechBrain
128
+ SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains.
129
+
130
+ Website: https://speechbrain.github.io/
131
+
132
+ GitHub: https://github.com/speechbrain/speechbrain
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "speechbrain_interface": "WhisperASR"
3
+ }
hyperparams.yaml ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ################################
2
+ # Model: Whisper (Encoder-Decoder) + NLL
3
+ # Augmentation: TimeDomainSpecAugment
4
+ # Authors: Pooneh Mousavi 2022
5
+ # ################################
6
+
7
+
8
+ # URL for the biggest Fairseq english whisper model.
9
+ whisper_hub: openai/whisper-large-v2
10
+
11
+ # Normalize inputs with
12
+ # the same normalization done in the paper. Refer to Appendix C for further information.
13
+ normalized_transcripts: True
14
+
15
+
16
+ language: persian
17
+
18
+ auto_mix_prec: False
19
+ sample_rate: 16000
20
+
21
+ # These values are only used for the searchers.
22
+ # They needs to be hardcoded and should not be changed with Whisper.
23
+ # They are used as part of the searching process.
24
+ # The bos token of the searcher will be timestamp_index
25
+ # and will be concatenated with the bos, language and task tokens.
26
+ timestamp_index: 50363
27
+ eos_index: 50257
28
+ bos_index: 50258
29
+
30
+ # Decoding parameters
31
+ min_decode_ratio: 0.0
32
+ max_decode_ratio: 0.1
33
+ test_beam_size: 8
34
+
35
+ # Model parameters
36
+ freeze_whisper: True
37
+ freeze_encoder: True
38
+
39
+
40
+
41
+ whisper: !new:speechbrain.lobes.models.huggingface_whisper.HuggingFaceWhisper
42
+ source: !ref <whisper_hub>
43
+ freeze: !ref <freeze_whisper>
44
+ freeze_encoder: !ref <freeze_encoder>
45
+ save_path: whisper_checkpoints
46
+ encoder_only: False
47
+
48
+
49
+
50
+ decoder: !new:speechbrain.decoders.seq2seq.S2SWhisperGreedySearch
51
+ model: !ref <whisper>
52
+ bos_index: !ref <timestamp_index>
53
+ eos_index: !ref <eos_index>
54
+ min_decode_ratio: !ref <min_decode_ratio>
55
+ max_decode_ratio: !ref <max_decode_ratio>
56
+
57
+ # test_beam_searcher: !new:speechbrain.decoders.seq2seq.S2SWhisperBeamSearch
58
+ # module: [!ref <whisper>]
59
+ # bos_index: !ref <timestamp_index>
60
+ # eos_index: !ref <eos_index>
61
+ # min_decode_ratio: !ref <min_decode_ratio>
62
+ # max_decode_ratio: !ref <max_decode_ratio>
63
+ # beam_size: !ref <test_beam_size>
64
+
65
+
66
+
67
+
68
+
69
+ modules:
70
+ whisper: !ref <whisper>
71
+ decoder: !ref <decoder>
72
+
73
+
74
+ pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
75
+ loadables:
76
+ whisper: !ref <whisper>
77
+