eustlb
/

distil-large-v3-fr

@@ -72,6 +72,7 @@ model-index:
       type: wer
       value: 7.984
 ---
 # Distil-Whisper: distil-large-v3-fr
 Distil-Whisper for English Automatic Speech Recognition (ASR) was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
@@ -87,11 +88,30 @@ The result is a distilled model that performs within **2% WER of whisper-large-v
 | whisper-small          |    242     |     2.3      |     16.36      |     12.47     |
 | whisper-medium         |    764     |     1.3      |     11.53      |     10.77     |
 | whisper-large-v3       |    1540    |     1.0      |      7.84      |     9.07      |
-| **distil-large-v3-fr** |  **756**   |   **5.9**    |    **9.36**    |   **11.47**   |
 *latencies benchmarked to generate 128 tokens on A100 40GB with a batch size of 1. More details about inference performances in [inference speed](#inference-speed) section.
 *WERs are averaged on the test sets. More details in [short-form](#short-form) and [long-form](#long-form) results sections.
 ## Transformers Usage
@@ -481,11 +501,13 @@ than Whisper large-v3, while performing to within 0.8% WER over long-form audio.
 Steps for getting started:
 1. Clone the Whisper.cpp repository:
-```
 git clone https://github.com/ggerganov/whisper.cpp.git
 cd whisper.cpp
-```whis
 2. Install the Hugging Face Hub Python package:
 ```bash
 pip install --upgrade huggingface_hub
 ```
@@ -500,8 +522,15 @@ hf_hub_download(repo_id='eustlb/distil-large-v3-fr-ggml', filename='ggml-distil-
 Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
 ```bash
-wget https://huggingface.co/distil-whisper/distil-large-v3-ggml/resolve/main/ggml-distil-large-v3-fr.bin -P ./models
 ````
 ### Transformers.js
@@ -528,7 +557,15 @@ const output = await transcriber(url);
 Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
 for further information.
-## Data
 distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
 Hugging Face Hub:
@@ -545,23 +582,23 @@ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use
 the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
 transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
-## WER Filter
 The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
 accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
 and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
 a specified threshold, we discard the training example. Otherwise, we keep it for training.
-We chose for this training a WER threshold of 20%, resulting in an effective training set of 2110 hours (750 for [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), 1040 for [MultiLingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) and 320 for [YODAS fr000 split](https://huggingface.co/datasets/espnet/yodas)).
 Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
 for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
 hallucinations to this filter.
-## Training
-The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). The two decoder layers were initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
 ## Results
@@ -569,11 +606,19 @@ The distilled model performs to within 1% WER of Whisper large-v3 on out-of-dist
 2.5% WER on out-of-distribuion sequential long-form decoding.
 ### Short-Form
 |     Model Name     |   RTF   | Common Voice 17 | Multilingual Librispeech | Voxpopuli | Fleurs |
 | :----------------: | :-----: | :-------------: | :----------------------: | :-------: | :----: |
-| distil-large-v3-fr | 319.543 |     12.726      |          5.823           |  10.808   | 8.067  |
 |    whisper-tiny    | 280.576 |     56.757      |          37.512          |  32.505   | 46.173 |
 |    whisper-base    | 261.235 |     42.447      |           25.2           |  26.434   | 27.851 |
 |   whisper-small    | 249.676 |     22.469      |          14.097          |   14.61   | 14.283 |
@@ -585,24 +630,24 @@ The distilled model performs to within 1% WER of Whisper large-v3 on out-of-dist
 ### Long-Form
-|     Model Name     |   RTF   | [long-form test set](https://huggingface.co/datasets/speech-recognition-community-v2/dev_data) |
-| :----------------: | :-----: | :--------------------------------------------------------------------------------------------: |
-| distil-large-v3-fr | 176.626 |                                             11.467                                             |
-|    whisper-tiny    | 125.367 |                                             28.277                                             |
-|    whisper-base    | 110.139 |                                             19.228                                             |
-|   whisper-small    | 83.417  |                                             12.467                                             |
-|   whisper-medium   | 56.677  |                                             10.772                                             |
-|  whisper-large-v3  | 41.805  |                                             9.073                                              |
 ### Inference speed
 Reported latencies were benchmarked on a 40GB nvidia A100, generating 128 tokens with SDPA, bfloat16, 3 warmup steps, 5 measures, one beam.
-The benchmarking script can be found here. The time measured is the time do one forward pass of the encoder and 128 autoregressive forward passes of the decoder.
 <p align="center">
-  <img src="figures/relative_latencies.png" alt="latencies" width="80%">
 </p>
@@ -636,4 +681,4 @@ If you use this model, please consider citing the [Distil-Whisper paper](https:/
 * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
 * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
 * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
-* [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech dataset

       type: wer
       value: 7.984
 ---
 # Distil-Whisper: distil-large-v3-fr
 Distil-Whisper for English Automatic Speech Recognition (ASR) was proposed in the paper [Robust Knowledge Distillation via Large-Scale Pseudo Labelling](https://arxiv.org/abs/2311.00430).
 | whisper-small          |    242     |     2.3      |     16.36      |     12.47     |
 | whisper-medium         |    764     |     1.3      |     11.53      |     10.77     |
 | whisper-large-v3       |    1540    |     1.0      |      7.84      |     9.07      |
+| **distil-large-v3-fr** |  **756**   |   **5.9**    |    **9.34**    |   **11.13**   |
 *latencies benchmarked to generate 128 tokens on A100 40GB with a batch size of 1. More details about inference performances in [inference speed](#inference-speed) section.
 *WERs are averaged on the test sets. More details in [short-form](#short-form) and [long-form](#long-form) results sections.
+## Table of Contents
+1. [Transformers Usage](#transformers-usage)
+   * [Short-Form Transcription](#short-form-transcription)
+   * [Sequential Long-Form](#sequential-long-form)
+   * [Chunked Long-Form](#chunked-long-form)
+   * [Speculative Decoding](#speculative-decoding)
+   * [Additional Speed and Memory Improvements](#additional-speed--memory-improvements)
+2. [Library Integrations](#library-integrations)
+   * [Whisper cpp](#whispercpp)
+   * [Transformers.js](#transformersjs)
+3. [Model Details](#model-details)
+   * [Architecture](#architecture)
+   * [Training](#training)
+4. [Results](#results)
+   * [Short-Form](#short-form)
+   * [Long-Form](#long-form)
+   * [Inference Speed](#inference-speed)
+4. [License](#license)
 ## Transformers Usage
 Steps for getting started:
 1. Clone the Whisper.cpp repository:
+```bash
 git clone https://github.com/ggerganov/whisper.cpp.git
 cd whisper.cpp
+```
 2. Install the Hugging Face Hub Python package:
 ```bash
 pip install --upgrade huggingface_hub
 ```
 Note that if you do not have a Python environment set-up, you can also download the weights directly with `wget`:
 ```bash
+wget https://huggingface.co/eustlb/distil-large-v3-fr-ggml/resolve/main/ggml-distil-large-v3-fr.bin -P ./models
 ````
+3. Run inference
+```bash
+wget https://huggingface.co/spaces/eustlb/whisper-vs-distil-whisper-fr/resolve/main/assets/example_1.wav
+make -j && ./main -m models/ggml-distil-large-v3-fr.bin -f example_1.wav
+```
 ### Transformers.js
 Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
 for further information.
+## Training
+### Architecture
+Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
+To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
+### Data
 distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
 Hugging Face Hub:
 the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
 transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
+### WER Filter
 The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
 accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
 and the ground truth labels provided by each dataset. We then compute the WER between these labels. If the WER exceeds
 a specified threshold, we discard the training example. Otherwise, we keep it for training.
+We chose for this training a WER threshold of 20%, resulting in an **effective training set of 2110 hours** (750 for [Common Voice 17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), 1040 for [MultiLingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) and 320 for [YODAS fr000 split](https://huggingface.co/datasets/espnet/yodas)).
 Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demonstrates the effectiveness of this filter
 for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
 hallucinations to this filter.
+### Training procedure
+The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
 ## Results
 2.5% WER on out-of-distribuion sequential long-form decoding.
+### Evaluation methodology
+The model has been tested for both in-distribution (Common Voice 17 and Multilingual Librispeech) and out-of-distribution (Fleurs, Voxpopuli, custom [long-form test set](https://huggingface.co/datasets/speech-recognition-community-v2/dev_data)) short-form and long-form transcription performances.
+**Short-form evaluations** are conducted on the four given datasets by first applying a filter to exclude samples longer than 30 seconds.
+**Long-form evaluation** is conducted on a custom out-of-distribution [long-form test set](https://huggingface.co/datasets/eustlb/french-long-form-test).
 ### Short-Form
 |     Model Name     |   RTF   | Common Voice 17 | Multilingual Librispeech | Voxpopuli | Fleurs |
 | :----------------: | :-----: | :-------------: | :----------------------: | :-------: | :----: |
+| distil-large-v3-fr | 310.127 |     12.681      |          5.865           |  10.851   | 7.984  |
 |    whisper-tiny    | 280.576 |     56.757      |          37.512          |  32.505   | 46.173 |
 |    whisper-base    | 261.235 |     42.447      |           25.2           |  26.434   | 27.851 |
 |   whisper-small    | 249.676 |     22.469      |          14.097          |   14.61   | 14.283 |
 ### Long-Form
+|     Model Name     |   RTF   | [long-form test set](https://huggingface.co/datasets/eustlb/french-long-form-test) |
+| :----------------: | :-----: | :--------------------------------------------------------------------------------: |
+| distil-large-v3-fr | 169.692 |                                       11.385                                       |
+|    whisper-tiny    | 125.367 |                                       28.277                                       |
+|    whisper-base    | 110.139 |                                       19.228                                       |
+|   whisper-small    | 83.417  |                                       12.467                                       |
+|   whisper-medium   | 56.677  |                                       10.772                                       |
+|  whisper-large-v3  | 41.805  |                                       9.073                                        |
 ### Inference speed
 Reported latencies were benchmarked on a 40GB nvidia A100, generating 128 tokens with SDPA, bfloat16, 3 warmup steps, 5 measures, one beam.
+The benchmarking script can be found [here](https://gist.github.com/eustlb/ef06f00858cbae4d8743f5024be869ec). The time measured is the time do one forward pass of the encoder and 128 autoregressive forward passes of the decoder.
 <p align="center">
+  <img src="https://huggingface.co/eustlb/whisper-large-v3-fr/resolve/main/assets/relative_latencies.png" alt="latencies" width="80%">
 </p>
 * [Joshua Lochner](https://huggingface.co/xenova) for the Transformers.js integration
 * [Laurent Mazare](https://huggingface.co/lmz) for the Candle integration
 * [Vaibhav Srivastav](https://huggingface.co/reach-vb) for Distil-Whisper distribution
+* [Raghav Sonavane](https://huggingface.co/rsonavane/distil-whisper-large-v2-8-ls) for an early iteration of Distil-Whisper on the LibriSpeech datasets