trip-fontaine
commited on
Commit
•
4a16b87
1
Parent(s):
425ee2e
readme update
Browse files
README.md
CHANGED
@@ -557,7 +557,7 @@ const output = await transcriber(url);
|
|
557 |
Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
|
558 |
for further information.
|
559 |
|
560 |
-
##
|
561 |
|
562 |
### Architecture
|
563 |
|
@@ -565,7 +565,9 @@ Distil-Whisper inherits the encoder-decoder architecture from Whisper. The encod
|
|
565 |
|
566 |
To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
|
567 |
|
568 |
-
###
|
|
|
|
|
569 |
|
570 |
distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
|
571 |
Hugging Face Hub:
|
@@ -582,7 +584,7 @@ The audio data is then pseudo-labelled using the Whisper large-v3 model: we use
|
|
582 |
the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
|
583 |
transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
|
584 |
|
585 |
-
|
586 |
|
587 |
The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
|
588 |
accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
|
@@ -596,7 +598,7 @@ Section 9.2 of the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) demo
|
|
596 |
for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
|
597 |
hallucinations to this filter.
|
598 |
|
599 |
-
|
600 |
|
601 |
The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
|
602 |
|
|
|
557 |
Refer to the Transformers.js [docs](https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.AutomaticSpeechRecognitionPipeline)
|
558 |
for further information.
|
559 |
|
560 |
+
## Model Details
|
561 |
|
562 |
### Architecture
|
563 |
|
|
|
565 |
|
566 |
To distill the Whisper model, we reduce the number of decoder layers while keeping the encoder fixed. The encoder (shown in green) is entirely copied from the teacher to the student and frozen during training. The student's decoder structure is copied from whisper-large-v3, with the only difference being a reduction from 32 to 2 decoder layers. These layers are initialized from distil-large-v3 to leverage language transfer from English to French (more details [here](https://github.com/huggingface/distil-whisper/tree/main/training#22-language-transfer)).
|
567 |
|
568 |
+
### Training
|
569 |
+
|
570 |
+
#### Data
|
571 |
|
572 |
distil-large-v3-fr is trained on 4,515 hours of audio data from three open-source, permissively licensed speech datasets on the
|
573 |
Hugging Face Hub:
|
|
|
584 |
the audio in our training set and use these as the target labels during training. Using pseudo-labels ensures that the
|
585 |
transcriptions are consistently formatted across datasets and provides sequence-level distillation signal during training.
|
586 |
|
587 |
+
#### WER Filter
|
588 |
|
589 |
The Whisper pseudo-label predictions are subject to mis-transcriptions and hallucinations. To ensure we only train on
|
590 |
accurate pseudo-labels, we employ a simple WER heuristic during training. First, we normalise the Whisper pseudo-labels
|
|
|
598 |
for improving downstream performance of the distilled model. We also partially attribute Distil-Whisper's robustness to
|
599 |
hallucinations to this filter.
|
600 |
|
601 |
+
#### Training procedure
|
602 |
|
603 |
The model was trained for 18,000 optimisation steps (or 14 epochs) with batch size 256. We saved the best model, based on the global wer score on validation splits, reached after 14,000 optimization steps (or 11 epochs). See the [Distil-Whisper paper](https://arxiv.org/abs/2311.00430) for more details (training objective, etc).
|
604 |
|