eustlb
/

distil-large-v3-fr

@@ -557,7 +557,7 @@ const output = await transcriber(url);
 Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
 for further information.
-## Training
 ### Architecture
@@ -565,7 +565,9 @@ Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encod
 To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
-### Data
 distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
 Hugging Face Hub:
@@ -582,7 +584,7 @@ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use
 the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
 transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
-### WER Filter
 The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
 accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
@@ -596,7 +598,7 @@ Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demo
 for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
 hallucinations to this filter.
-### Training procedure
 The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).

 Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
 for further information.
+## Model Details
 ### Architecture
 To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
+### Training
+#### Data
 distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
 Hugging Face Hub:
 the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
 transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
+#### WER Filter
 The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
 accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
 for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
 hallucinations to this filter.
+#### Training procedure
 The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).