marianbasti commited on
Commit
c0e17a3
1 Parent(s): f45cbe2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +198 -0
README.md ADDED
@@ -0,0 +1,198 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - mozilla-foundation/common_voice_16_1
5
+ language:
6
+ - es
7
+ library_name: transformers
8
+ pipeline_tag: automatic-speech-recognition
9
+ tags:
10
+ - spanish
11
+ - speech
12
+ - recognition
13
+ - whisper
14
+ - distl-whisper
15
+ ---
16
+
17
+ # distil-whisper-large-v3-es
18
+ This is the repository for a distilled version of the [Whisper v3 large model](https://huggingface.co/openai/whisper-large-v3) trained on the [Mozilla Common Voice dataset v16.1](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1).
19
+
20
+ ## Usage
21
+
22
+ Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first
23
+ install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load toy
24
+ audio dataset from the Hugging Face Hub:
25
+
26
+ ```bash
27
+ pip install --upgrade pip
28
+ pip install --upgrade transformers accelerate datasets[audio]
29
+ ```
30
+
31
+ ### Short-Form Transcription
32
+
33
+ The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
34
+ class to transcribe short-form audio files (< 30-seconds) as follows:
35
+
36
+ ```python
37
+ import torch
38
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
39
+ from datasets import load_dataset
40
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
41
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
42
+ model_id = "distil-whisper/distil-large-v2"
43
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
44
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
45
+ )
46
+ model.to(device)
47
+ processor = AutoProcessor.from_pretrained(model_id)
48
+ pipe = pipeline(
49
+ "automatic-speech-recognition",
50
+ model=model,
51
+ tokenizer=processor.tokenizer,
52
+ feature_extractor=processor.feature_extractor,
53
+ max_new_tokens=128,
54
+ torch_dtype=torch_dtype,
55
+ device=device,
56
+ )
57
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
58
+ sample = dataset[0]["audio"]
59
+ result = pipe(sample)
60
+ print(result["text"])
61
+ ```
62
+
63
+ To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
64
+ ```diff
65
+ - result = pipe(sample)
66
+ + result = pipe("audio.mp3")
67
+ ```
68
+
69
+ ### Long-Form Transcription
70
+
71
+ Distil-Whisper uses a chunked algorithm to transcribe long-form audio files (> 30-seconds). In practice, this chunked long-form algorithm
72
+ is 9x faster than the sequential algorithm proposed by OpenAI in the Whisper paper (see Table 7 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430)).
73
+
74
+ To enable chunking, pass the `chunk_length_s` parameter to the `pipeline`. For Distil-Whisper, a chunk length of 15-seconds
75
+ is optimal. To activate batching, pass the argument `batch_size`:
76
+
77
+ ```python
78
+ import torch
79
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
80
+ from datasets import load_dataset
81
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
82
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
83
+ model_id = "marianbasti/distil-whisper-large-v3-es"
84
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
85
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
86
+ )
87
+ model.to(device)
88
+ processor = AutoProcessor.from_pretrained(model_id)
89
+ pipe = pipeline(
90
+ "automatic-speech-recognition",
91
+ model=model,
92
+ tokenizer=processor.tokenizer,
93
+ feature_extractor=processor.feature_extractor,
94
+ max_new_tokens=128,
95
+ chunk_length_s=15,
96
+ batch_size=16,
97
+ torch_dtype=torch_dtype,
98
+ device=device,
99
+ )
100
+ dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
101
+ sample = dataset[0]["audio"]
102
+ result = pipe(sample)
103
+ print(result["text"])
104
+ ```
105
+
106
+ <!---
107
+ **Tip:** The pipeline can also be used to transcribe an audio file from a remote URL, for example:
108
+
109
+ ```python
110
+ result = pipe("https://huggingface.co/datasets/sanchit-gandhi/librispeech_long/resolve/main/audio.wav")
111
+ ```
112
+ --->
113
+
114
+ ### Speculative Decoding
115
+
116
+ Distil-Whisper can be used as an assistant model to Whisper for [speculative decoding](https://huggingface.co/blog/whisper-speculative-decoding).
117
+ Speculative decoding mathematically ensures the exact same outputs as Whisper are obtained while being 2 times faster.
118
+ This makes it the perfect drop-in replacement for existing Whisper pipelines, since the same outputs are guaranteed.
119
+
120
+ In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
121
+ specify it as the "assistant model" for generation:
122
+
123
+ ```python
124
+ from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
125
+ import torch
126
+ from datasets import load_dataset
127
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
128
+ torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
129
+ assistant_model_id = "marianbasti/distil-whisper-large-v3-es"
130
+ assistant_model = AutoModelForCausalLM.from_pretrained(
131
+ assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
132
+ )
133
+ assistant_model.to(device)
134
+ model_id = "openai/whisper-large-v2"
135
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(
136
+ model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
137
+ )
138
+ model.to(device)
139
+ processor = AutoProcessor.from_pretrained(model_id)
140
+ pipe = pipeline(
141
+ "automatic-speech-recognition",
142
+ model=model,
143
+ tokenizer=processor.tokenizer,
144
+ feature_extractor=processor.feature_extractor,
145
+ max_new_tokens=128,
146
+ generate_kwargs={"assistant_model": assistant_model},
147
+ torch_dtype=torch_dtype,
148
+ device=device,
149
+ )
150
+ dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
151
+ sample = dataset[0]["audio"]
152
+ result = pipe(sample)
153
+ print(result["text"])
154
+ ```
155
+ ## Training
156
+
157
+ The model was trained for 40,000 optimisation steps (or four epochs), and the following training parameters:
158
+ ```
159
+ --teacher_model_name_or_path "openai/whisper-large-v3"
160
+ --train_dataset_name "mozilla-foundation/common_voice_16_1"
161
+ --train_dataset_config_name "es"
162
+ --train_split_name "train"
163
+ --text_column_name "sentence"
164
+ --eval_dataset_name "mozilla-foundation/common_voice_16_1"
165
+ --eval_dataset_config_name "es"
166
+ --eval_split_name "validation"
167
+ --eval_text_column_name "sentence"
168
+ --eval_steps 5000
169
+ --save_steps 5000
170
+ --warmup_steps 500
171
+ --learning_rate 1e-4
172
+ --lr_scheduler_type "linear"
173
+ --logging_steps 25
174
+ --save_total_limit 1
175
+ --max_steps 40000
176
+ ```
177
+
178
+ ## Results
179
+
180
+ The distilled model performs with a 5.874% normalized WER. Further training would yield better results
181
+
182
+ ## License
183
+
184
+ Distil-Whisper inherits the [MIT license](https://github.com/huggingface/distil-whisper/blob/main/LICENSE) from OpenAI's Whisper model.
185
+
186
+ ## Citation
187
+
188
+ If you use this model, please consider citing the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430):
189
+ ```
190
+ @misc{gandhi2023distilwhisper,
191
+ title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
192
+ author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
193
+ year={2023},
194
+ eprint={2311.00430},
195
+ archivePrefix={arXiv},
196
+ primaryClass={cs.CL}
197
+ }
198
+ ```