eustlb
/

distil-large-v3-fr

@@ -79,7 +79,7 @@ Distil-Whisper for English Automatic Speech Recognition (ASR) was proposed in th
 This is the knowledge distilled version of OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) for French ASR.
-The result is a distilled model that performs within **2% WER of whisper-large-v3** on out-of-distribution evaluation sets for both short-form and long form transcription. Moreover, it is **5.9x** faster than whisper-large-v3 and **1.3** times faster than the tiniest version of whisper while being uncomparably more accurate.
 | Model                  | Params (M) | Rel. Latency | Short-Form WER | Long-Form WER |
 | :--------------------- | :--------: | :----------: | :------------: | :-----------: |
@@ -563,7 +563,7 @@ for further information.
 Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
-To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
 ### Training

 This is the knowledge distilled version of OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) for French ASR.
+The result is a distilled model that performs within **2% WER of [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3)** on out-of-distribution evaluation sets for both short-form and long form transcription. Moreover, it is **5.9x** faster than [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3) and **1.3** times faster than the tiniest version of whisper while being uncomparably more accurate.
 | Model                  | Params (M) | Rel. Latency | Short-Form WER | Long-Form WER |
 | :--------------------- | :--------: | :----------: | :------------: | :-----------: |
 Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encoder maps a sequence of speech vector inputs to a sequence of hidden-state vectors. The decoder auto-regressively predicts text tokens, conditional on all previous tokens and the encoder hidden-states. Consequently, the encoder is only run forward once, whereas the decoder is run as many times as the number of tokens generated. In practice, this means the decoder accounts for over 90% of total inference time. Thus, to optimise for latency, the focus is on minimising the inference time of the decoder.
+To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from [Whisper large-v3](https://huggingface.co/openai/whisper-large-v3), with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
 ### Training