Automatic Speech Recognition
Transformers
Safetensors
French
whisper
asr
Eval Results
Inference Endpoints
trip-fontaine commited on
Commit
88b6b00
1 Parent(s): d0d4c67

readme update

Browse files
Files changed (1) hide show
  1. README.md +66 -21
README.md CHANGED
@@ -72,6 +72,7 @@ model-index:
72
  type: wer
73
  value: 7.984
74
  ---
 
75
  # Distil-Whisper: distil-large-v3-fr
76
 
77
  Distil-Whisper for English Automatic Speech Recognition (ASR) was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
@@ -87,11 +88,30 @@ The result is a distilled model that performs within **2% WER of whisper-large-v
87
  | whisper-small | 242 | 2.3 | 16.36 | 12.47 |
88
  | whisper-medium | 764 | 1.3 | 11.53 | 10.77 |
89
  | whisper-large-v3 | 1540 | 1.0 | 7.84 | 9.07 |
90
- | **distil-large-v3-fr** | **756** | **5.9** | **9.36** | **11.47** |
91
 
92
  *latencies benchmarked to generate 128 tokens on A100 40GB with a batch size of 1. More details about inference performances in [inference speed](#inference-speed) section.
93
  *WERs are averaged on the test sets. More details in [short-form](#short-form) and [long-form](#long-form) results sections.
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
 
96
 
97
  ## Transformers Usage
@@ -481,11 +501,13 @@ than Whisper large-v3, while performing to within 0.8% WER over long-form audio.
481
  Steps for getting started:
482
 
483
  1. Clone the Whisper.cpp repository:
484
- ```
485
  git clone https://github.com/ggerganov/whisper.cpp.git
486
  cd whisper.cpp
487
- ```whis
 
488
  2. Install the Hugging Face Hub Python package:
 
489
  ```bash
490
  pip install --upgrade huggingface_hub
491
  ```
@@ -500,8 +522,15 @@ hf_hub_download(repo_id='eustlb/distil-large-v3-fr-ggml', filename='ggml-distil-
500
  Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
501
 
502
  ```bash
503
- wget https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3-fr.bin -P ./models
504
  ````
 
 
 
 
 
 
 
505
 
506
  ### Transformers.js
507
 
@@ -528,7 +557,15 @@ const output = await transcriber(url);
528
  Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
529
  for further information.
530
 
531
- ## Data
 
 
 
 
 
 
 
 
532
 
533
  distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
534
  Hugging Face Hub:
@@ -545,23 +582,23 @@ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use
545
  the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
546
  transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
547
 
548
- ## WER Filter
549
 
550
  The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
551
  accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
552
  and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
553
  a specified threshold, we discard the training example. Otherwise, we keep it for training.
554
 
555
- We chose for this training a WER threshold of 20%, resulting in an effective training set of 2110 hours (750 for [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), 1040 for [MultiLingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) and 320 for [YODAS fr000 split](https://huggingface.co/datasets/espnet/yodas)).
556
 
557
 
558
  Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
559
  for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
560
  hallucinations to this filter.
561
 
562
- ## Training
563
 
564
- The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). The two decoder layers were initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
565
 
566
  ## Results
567
 
@@ -569,11 +606,19 @@ The distilled model performs to within 1% WER of Whisper large-v3 on out-of-dist
569
  2.5% WER on out-of-distribuion sequential long-form decoding.
570
 
571
 
 
 
 
 
 
 
 
 
572
  ### Short-Form
573
 
574
  | Model Name | RTF | Common Voice 17 | Multilingual Librispeech | Voxpopuli | Fleurs |
575
  | :----------------: | :-----: | :-------------: | :----------------------: | :-------: | :----: |
576
- | distil-large-v3-fr | 319.543 | 12.726 | 5.823 | 10.808 | 8.067 |
577
  | whisper-tiny | 280.576 | 56.757 | 37.512 | 32.505 | 46.173 |
578
  | whisper-base | 261.235 | 42.447 | 25.2 | 26.434 | 27.851 |
579
  | whisper-small | 249.676 | 22.469 | 14.097 | 14.61 | 14.283 |
@@ -585,24 +630,24 @@ The distilled model performs to within 1% WER of Whisper large-v3 on out-of-dist
585
  ### Long-Form
586
 
587
 
588
- | Model Name | RTF | [long-form test set](https://huggingface.co/datasets/speech-recognition-community-v2/dev_data) |
589
- | :----------------: | :-----: | :--------------------------------------------------------------------------------------------: |
590
- | distil-large-v3-fr | 176.626 | 11.467 |
591
- | whisper-tiny | 125.367 | 28.277 |
592
- | whisper-base | 110.139 | 19.228 |
593
- | whisper-small | 83.417 | 12.467 |
594
- | whisper-medium | 56.677 | 10.772 |
595
- | whisper-large-v3 | 41.805 | 9.073 |
596
 
597
 
598
  ### Inference speed
599
 
600
  Reported latencies were benchmarked on a 40GB nvidia A100, generating 128 tokens with SDPA, bfloat16, 3 warmup steps, 5 measures, one beam.
601
- The benchmarking script can be found here. The time measured is the time do one forward pass of the encoder and 128 autoregressive forward passes of the decoder.
602
 
603
 
604
  <p align="center">
605
- <img src="figures/relative_latencies.png" alt="latencies" width="80%">
606
  </p>
607
 
608
 
@@ -636,4 +681,4 @@ If you use this model, please consider citing the [Distil-Whisper paper](https:/
636
  * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
637
  * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
638
  * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
639
- * [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset
 
72
  type: wer
73
  value: 7.984
74
  ---
75
+
76
  # Distil-Whisper: distil-large-v3-fr
77
 
78
  Distil-Whisper for English Automatic Speech Recognition (ASR) was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
 
88
  | whisper-small | 242 | 2.3 | 16.36 | 12.47 |
89
  | whisper-medium | 764 | 1.3 | 11.53 | 10.77 |
90
  | whisper-large-v3 | 1540 | 1.0 | 7.84 | 9.07 |
91
+ | **distil-large-v3-fr** | **756** | **5.9** | **9.34** | **11.13** |
92
 
93
  *latencies benchmarked to generate 128 tokens on A100 40GB with a batch size of 1. More details about inference performances in [inference speed](#inference-speed) section.
94
  *WERs are averaged on the test sets. More details in [short-form](#short-form) and [long-form](#long-form) results sections.
95
 
96
+ ## Table of Contents
97
+
98
+ 1. [Transformers Usage](#transformers-usage)
99
+ * [Short-Form Transcription](#short-form-transcription)
100
+ * [Sequential Long-Form](#sequential-long-form)
101
+ * [Chunked Long-Form](#chunked-long-form)
102
+ * [Speculative Decoding](#speculative-decoding)
103
+ * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
104
+ 2. [Library Integrations](#library-integrations)
105
+ * [Whisper cpp](#whispercpp)
106
+ * [Transformers.js](#transformersjs)
107
+ 3. [Model Details](#model-details)
108
+ * [Architecture](#architecture)
109
+ * [Training](#training)
110
+ 4. [Results](#results)
111
+ * [Short-Form](#short-form)
112
+ * [Long-Form](#long-form)
113
+ * [Inference Speed](#inference-speed)
114
+ 4. [License](#license)
115
 
116
 
117
  ## Transformers Usage
 
501
  Steps for getting started:
502
 
503
  1. Clone the Whisper.cpp repository:
504
+ ```bash
505
  git clone https://github.com/ggerganov/whisper.cpp.git
506
  cd whisper.cpp
507
+ ```
508
+
509
  2. Install the Hugging Face Hub Python package:
510
+
511
  ```bash
512
  pip install --upgrade huggingface_hub
513
  ```
 
522
  Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
523
 
524
  ```bash
525
+ wget https://huggingface.co/eustlb/distil-large-v3-fr-ggml/resolve/main/ggml-distil-large-v3-fr.bin -P ./models
526
  ````
527
+ 3. Run inference
528
+
529
+ ```bash
530
+ wget https://huggingface.co/spaces/eustlb/whisper-vs-distil-whisper-fr/resolve/main/assets/example_1.wav
531
+ make -j && ./main -m models/ggml-distil-large-v3-fr.bin -f example_1.wav
532
+ ```
533
+
534
 
535
  ### Transformers.js
536
 
 
557
  Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
558
  for further information.
559
 
560
+ ## Training
561
+
562
+ ### Architecture
563
+
564
+ Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
565
+
566
+ To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
567
+
568
+ ### Data
569
 
570
  distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
571
  Hugging Face Hub:
 
582
  the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
583
  transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
584
 
585
+ ### WER Filter
586
 
587
  The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
588
  accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
589
  and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
590
  a specified threshold, we discard the training example. Otherwise, we keep it for training.
591
 
592
+ We chose for this training a WER threshold of 20%, resulting in an **effective training set of 2110 hours** (750 for [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), 1040 for [MultiLingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) and 320 for [YODAS fr000 split](https://huggingface.co/datasets/espnet/yodas)).
593
 
594
 
595
  Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
596
  for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
597
  hallucinations to this filter.
598
 
599
+ ### Training procedure
600
 
601
+ The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
602
 
603
  ## Results
604
 
 
606
  2.5% WER on out-of-distribuion sequential long-form decoding.
607
 
608
 
609
+ ### Evaluation methodology
610
+
611
+ The model has been tested for both in-distribution (Common Voice 17 and Multilingual Librispeech) and out-of-distribution (Fleurs, Voxpopuli, custom [long-form test set](https://huggingface.co/datasets/speech-recognition-community-v2/dev_data)) short-form and long-form transcription performances.
612
+
613
+ **Short-form evaluations** are conducted on the four given datasets by first applying a filter to exclude samples longer than 30 seconds.
614
+
615
+ **Long-form evaluation** is conducted on a custom out-of-distribution [long-form test set](https://huggingface.co/datasets/eustlb/french-long-form-test).
616
+
617
  ### Short-Form
618
 
619
  | Model Name | RTF | Common Voice 17 | Multilingual Librispeech | Voxpopuli | Fleurs |
620
  | :----------------: | :-----: | :-------------: | :----------------------: | :-------: | :----: |
621
+ | distil-large-v3-fr | 310.127 | 12.681 | 5.865 | 10.851 | 7.984 |
622
  | whisper-tiny | 280.576 | 56.757 | 37.512 | 32.505 | 46.173 |
623
  | whisper-base | 261.235 | 42.447 | 25.2 | 26.434 | 27.851 |
624
  | whisper-small | 249.676 | 22.469 | 14.097 | 14.61 | 14.283 |
 
630
  ### Long-Form
631
 
632
 
633
+ | Model Name | RTF | [long-form test set](https://huggingface.co/datasets/eustlb/french-long-form-test) |
634
+ | :----------------: | :-----: | :--------------------------------------------------------------------------------: |
635
+ | distil-large-v3-fr | 169.692 | 11.385 |
636
+ | whisper-tiny | 125.367 | 28.277 |
637
+ | whisper-base | 110.139 | 19.228 |
638
+ | whisper-small | 83.417 | 12.467 |
639
+ | whisper-medium | 56.677 | 10.772 |
640
+ | whisper-large-v3 | 41.805 | 9.073 |
641
 
642
 
643
  ### Inference speed
644
 
645
  Reported latencies were benchmarked on a 40GB nvidia A100, generating 128 tokens with SDPA, bfloat16, 3 warmup steps, 5 measures, one beam.
646
+ The benchmarking script can be found [here](https://gist.github.com/eustlb/ef06f00858cbae4d8743f5024be869ec). The time measured is the time do one forward pass of the encoder and 128 autoregressive forward passes of the decoder.
647
 
648
 
649
  <p align="center">
650
+ <img src="https://huggingface.co/eustlb/whisper-large-v3-fr/resolve/main/assets/relative_latencies.png" alt="latencies" width="80%">
651
  </p>
652
 
653
 
 
681
  * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
682
  * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
683
  * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
684
+ * [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech datasets