Automatic Speech Recognition
Transformers
Safetensors
French
whisper
asr
Eval Results
Inference Endpoints
trip-fontaine commited on
Commit
4a16b87
1 Parent(s): 425ee2e

readme update

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -557,7 +557,7 @@ const output = await transcriber(url);
557
  Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
558
  for further information.
559
 
560
- ## Training
561
 
562
  ### Architecture
563
 
@@ -565,7 +565,9 @@ Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encod
565
 
566
  To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
567
 
568
- ### Data
 
 
569
 
570
  distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
571
  Hugging Face Hub:
@@ -582,7 +584,7 @@ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use
582
  the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
583
  transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
584
 
585
- ### WER Filter
586
 
587
  The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
588
  accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
@@ -596,7 +598,7 @@ Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demo
596
  for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
597
  hallucinations to this filter.
598
 
599
- ### Training procedure
600
 
601
  The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
602
 
 
557
  Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
558
  for further information.
559
 
560
+ ## Model Details
561
 
562
  ### Architecture
563
 
 
565
 
566
  To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
567
 
568
+ ### Training
569
+
570
+ #### Data
571
 
572
  distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
573
  Hugging Face Hub:
 
584
  the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
585
  transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
586
 
587
+ #### WER Filter
588
 
589
  The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
590
  accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
 
598
  for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
599
  hallucinations to this filter.
600
 
601
+ #### Training procedure
602
 
603
  The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
604