sanchit-gandhi HF staff commited on
Commit
9cf7d4e
1 Parent(s): 3f45efe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -40
README.md CHANGED
@@ -24,17 +24,16 @@ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), the
24
  to date.
25
 
26
  Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
27
- **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**. The result is a distilled model
28
- that performs to within 1% WER of large-v3 on long-form audio, independent of the transcription algorithm, and
29
- outperforms distil-large-v2 by 4.8% and 0.7% using the sequential and chunked long-form algorithms respectively.
30
- Furthermore, the model is also faster than previous Distil-Whisper models: **5.7x faster than large-v3**, and 1.1x faster
31
- than distil-large-v2.
32
 
33
  | Model | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
34
  |------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
35
  | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
36
- | **[distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)** | **756** | **5.7** | **9.8** | **10.8** | **10.9** |
37
- | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.2 | 9.6 | 15.6 | 11.6 |
38
 
39
  Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
40
  (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible across all libraries types.
@@ -49,15 +48,15 @@ with instructions for getting started [here](#library-integrations).
49
  * [Sequential Long-Form](#sequential-long-form)
50
  * [Chunked Long-Form](#chunked-long-form)
51
  * [Speculative Decoding](#speculative-decoding)
52
- 2. [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
53
- 3. [Library Integrations](#library-integrations)
54
- * [OpenAI Whisper](#openai-whisper)
55
  * [Whisper cpp](#whispercpp)
56
  * [Faster Whisper](#faster-whisper)
 
57
  * [Transformers.js](#transformersjs)
58
  * [Candle](#candle)
59
- 4. [Model Details](#model-details)
60
- 5. [License](#license)
61
 
62
  ## Transformers Usage
63
 
@@ -251,27 +250,27 @@ dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.samplin
251
  sample = dataset[0]["audio"]
252
 
253
  inputs = processor(
254
- sample["array"],
255
- sampling_rate=sample["sampling_rate"],
256
- return_tensors="pt",
257
- truncation=False,
258
- padding="longest",
259
- return_attention_mask=True,
260
  )
261
  inputs = inputs.to(device, dtype=torch_dtype)
262
 
263
  gen_kwargs = {
264
- "max_new_tokens": 448,
265
- "num_beams": 1, # set > 1 for beam-search
266
- "condition_on_prev_tokens": False, # condition for previous context
267
- "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
268
- "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # temperature fallback
269
- "logprob_threshold": -1.0,
270
- "no_speech_threshold": 0.6,
271
- "return_timestamps": True,
272
  }
273
 
274
- pred_ids = model.generate(**inputs, **gen_kwargs)
275
  pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
276
 
277
  print(pred_text)
@@ -379,13 +378,13 @@ print(result["text"])
379
 
380
  For more details on speculative decoding, refer to the blog post [Speculative Decoding for 2x Faster Whisper Inference](https://huggingface.co/blog/whisper-speculative-decoding).
381
 
382
- ## Additional Speed & Memory Improvements
383
 
384
  You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
385
  requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
386
  more efficient flash attention version.
387
 
388
- ### Flash Attention 2
389
 
390
  We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
391
  if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
@@ -401,14 +400,16 @@ Then pass `attn_implementation="flash_attention_2"` to `from_pretrained`:
401
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
402
  ```
403
 
404
- ### Torch Scale-Product-Attention (SDPA)
405
 
406
  If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
407
  This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
408
- whether you have a compatible PyTorch version, run the following command:
409
 
410
- ```bash
411
- python -c "from transformers.utils import is_torch_sdpa_available; print(is_torch_sdpa_available())"
 
 
412
  ```
413
 
414
  If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
@@ -422,11 +423,11 @@ Once a valid PyTorch version is installed, SDPA is activated by default. It can
422
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
423
  ```
424
 
425
- ### Torch compile
426
 
427
  Coming soon...
428
 
429
- ### 4-bit and 8-bit Inference
430
 
431
  Coming soon...
432
 
@@ -445,16 +446,18 @@ Steps for getting started:
445
  git clone https://github.com/ggerganov/whisper.cpp.git
446
  cd whisper.cpp
447
  ```
448
- 2. Download the ggml weights for distil-large-v3 from the Hugging Face Hub:
449
 
450
- ```bash
451
- python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='distil-whisper/distil-large-v3', filename='ggml-distil-large-v3.bin', local_dir='./models')"
 
 
452
  ```
453
 
454
  Note that if you do not have the `huggingface_hub` package installed, you can also download the weights with `wget`:
455
 
456
  ```bash
457
- wget https://huggingface.co/distil-whisper/distil-large-v3/resolve/main/ggml-distil-large-v3.bin -P ./models
458
  ```
459
 
460
  3. Run inference using the provided sample audio:
@@ -834,8 +837,9 @@ If you use this model, please consider citing the [Distil-Whisper paper](https:/
834
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3), in particular Jong Wook Kim for the [original codebase](https://github.com/openai/whisper) and training discussions
835
  * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration
836
  * [Georgi Gerganov](https://huggingface.co/ggerganov) for the Whisper cpp integration
 
837
  * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
838
  * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
839
  * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
840
  * Google's [TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) programme for Cloud TPU v4 compute resource
841
- * [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset
 
24
  to date.
25
 
26
  Compared to previous Distil-Whisper models, the distillation procedure for distil-large-v3 has been adapted to give
27
+ **superior long-form transcription accuracy** with OpenAI's **sequential long-form algorithm**. The result is a distilled
28
+ model that performs to within 1% WER of large-v3 on long-form audio using both the sequential and chunked algorithms, and
29
+ outperforms distil-large-v2 by 4.8% using the sequential algorithm. The model is also faster than previous Distil-Whisper
30
+ models: **6.3x faster than large-v3**, and 1.1x faster than distil-large-v2.
 
31
 
32
  | Model | Params / M | Rel. Latency | Short-Form | Sequential Long-Form | Chunked Long-Form |
33
  |------------------------------------------------------------------------------|------------|--------------|------------|----------------------|-------------------|
34
  | [large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 | 8.4 | 10.0 | 11.0 |
35
+ | **[distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)** | **756** | **6.3** | **9.7** | **10.8** | **10.9** |
36
+ | [distil-large-v2](https://huggingface.co/distil-whisper/distil-large-v2) | 756 | 5.8 | 9.6 | 15.6 | 11.6 |
37
 
38
  Since the sequential algorithm is the "de-facto" transcription algorithm across the most popular Whisper libraries
39
  (Whisper cpp, Faster-Whisper, OpenAI Whisper), this distilled model is designed to be compatible across all libraries types.
 
48
  * [Sequential Long-Form](#sequential-long-form)
49
  * [Chunked Long-Form](#chunked-long-form)
50
  * [Speculative Decoding](#speculative-decoding)
51
+ * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
52
+ 2. [Library Integrations](#library-integrations)
 
53
  * [Whisper cpp](#whispercpp)
54
  * [Faster Whisper](#faster-whisper)
55
+ * [OpenAI Whisper](#openai-whisper)
56
  * [Transformers.js](#transformersjs)
57
  * [Candle](#candle)
58
+ 3. [Model Details](#model-details)
59
+ 4. [License](#license)
60
 
61
  ## Transformers Usage
62
 
 
250
  sample = dataset[0]["audio"]
251
 
252
  inputs = processor(
253
+ sample["array"],
254
+ sampling_rate=sample["sampling_rate"],
255
+ return_tensors="pt",
256
+ truncation=False,
257
+ padding="longest",
258
+ return_attention_mask=True,
259
  )
260
  inputs = inputs.to(device, dtype=torch_dtype)
261
 
262
  gen_kwargs = {
263
+ "max_new_tokens": 448,
264
+ "num_beams": 1, # set > 1 for beam-search
265
+ "condition_on_prev_tokens": False, # condition for previous context
266
+ "compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
267
+ "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # temperature fallback
268
+ "logprob_threshold": -1.0,
269
+ "no_speech_threshold": 0.6,
270
+ "return_timestamps": True,
271
  }
272
 
273
+ pred_ids = model.generate(**i nputs, **gen_kwargs)
274
  pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
275
 
276
  print(pred_text)
 
378
 
379
  For more details on speculative decoding, refer to the blog post [Speculative Decoding for 2x Faster Whisper Inference](https://huggingface.co/blog/whisper-speculative-decoding).
380
 
381
+ ### Additional Speed & Memory Improvements
382
 
383
  You can apply additional speed and memory improvements to Distil-Whisper to further reduce the inference speed and VRAM
384
  requirements. These optimisations primarily target the attention kernel, swapping it from an eager implementation to a
385
  more efficient flash attention version.
386
 
387
+ #### Flash Attention 2
388
 
389
  We recommend using [Flash-Attention 2](https://huggingface.co/docs/transformers/main/en/perf_infer_gpu_one#flashattention-2)
390
  if your GPU allows for it. To do so, you first need to install [Flash Attention](https://github.com/Dao-AILab/flash-attention):
 
400
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="flash_attention_2")
401
  ```
402
 
403
+ #### Torch Scale-Product-Attention (SDPA)
404
 
405
  If your GPU does not support Flash Attention, we recommend making use of PyTorch [scaled dot-product attention (SDPA)](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html).
406
  This attention implementation is activated **by default** for PyTorch versions 2.1.1 or greater. To check
407
+ whether you have a compatible PyTorch version, run the following Python code snippet:
408
 
409
+ ```python
410
+ from transformers.utils import is_torch_sdpa_available
411
+
412
+ print(is_torch_sdpa_available())
413
  ```
414
 
415
  If the above returns `True`, you have a valid version of PyTorch installed and SDPA is activated by default. If it
 
423
  + model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, attn_implementation="sdpa")
424
  ```
425
 
426
+ #### Torch compile
427
 
428
  Coming soon...
429
 
430
+ #### 4-bit and 8-bit Inference
431
 
432
  Coming soon...
433
 
 
446
  git clone https://github.com/ggerganov/whisper.cpp.git
447
  cd whisper.cpp
448
  ```
449
+ 2. Download the GGML weights for distil-large-v3 from the Hugging Face Hub using the following Python snippet:
450
 
451
+ ```python
452
+ from huggingface_hub import hf_hub_download
453
+
454
+ hf_hub_download(repo_id='distil-whisper/distil-large-v3-ggml', filename='ggml-distil-large-v3.bin', local_dir='./models')
455
  ```
456
 
457
  Note that if you do not have the `huggingface_hub` package installed, you can also download the weights with `wget`:
458
 
459
  ```bash
460
+ wget https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3.bin -P ./models
461
  ```
462
 
463
  3. Run inference using the provided sample audio:
 
837
  * OpenAI for the Whisper [model](https://huggingface.co/openai/whisper-large-v3), in particular Jong Wook Kim for the [original codebase](https://github.com/openai/whisper) and training discussions
838
  * Hugging Face 🤗 [Transformers](https://github.com/huggingface/transformers) for the model integration
839
  * [Georgi Gerganov](https://huggingface.co/ggerganov) for the Whisper cpp integration
840
+ * [Systran team](https://github.com/SYSTRAN) for the Faster-Whisper integration
841
  * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
842
  * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
843
  * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
844
  * Google's [TPU Research Cloud (TRC)](https://sites.research.google/trc/about/) programme for Cloud TPU v4 compute resource
845
+ * [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset