{
"cells": [
{
"cell_type": "markdown",
"id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6",
"metadata": {
"id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6"
},
"source": [
"# Fine-Tune Whisper For Multilingual ASR with ๐ค Transformers"
]
},
{
"cell_type": "markdown",
"id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a",
"metadata": {
"id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a"
},
"source": [
"In this Colab, we present a step-by-step guide on how to fine-tune Whisper \n",
"for any multilingual ASR dataset using Hugging Face ๐ค Transformers. This is a \n",
"more \"hands-on\" version of the accompanying [blog post](https://huggingface.co/blog/fine-tune-whisper). \n",
"For a more in-depth explanation of Whisper, the Common Voice dataset and the theory behind fine-tuning, the reader is advised to refer to the blog post."
]
},
{
"cell_type": "markdown",
"id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e",
"metadata": {
"id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e"
},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0",
"metadata": {
"id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0"
},
"source": [
"Whisper is a pre-trained model for automatic speech recognition (ASR) \n",
"published in [September 2022](https://openai.com/blog/whisper/) by the authors \n",
"Alec Radford et al. from OpenAI. Unlike many of its predecessors, such as \n",
"[Wav2Vec 2.0](https://arxiv.org/abs/2006.11477), which are pre-trained \n",
"on un-labelled audio data, Whisper is pre-trained on a vast quantity of \n",
"**labelled** audio-transcription data, 680,000 hours to be precise. \n",
"This is an order of magnitude more data than the un-labelled audio data used \n",
"to train Wav2Vec 2.0 (60,000 hours). What is more, 117,000 hours of this \n",
"pre-training data is multilingual ASR data. This results in checkpoints \n",
"that can be applied to over 96 languages, many of which are considered \n",
"_low-resource_.\n",
"\n",
"When scaled to 680,000 hours of labelled pre-training data, Whisper models \n",
"demonstrate a strong ability to generalise to many datasets and domains.\n",
"The pre-trained checkpoints achieve competitive results to state-of-the-art \n",
"ASR systems, with near 3% word error rate (WER) on the test-clean subset of \n",
"LibriSpeech ASR and a new state-of-the-art on TED-LIUM with 4.7% WER (_c.f._ \n",
"Table 8 of the [Whisper paper](https://cdn.openai.com/papers/whisper.pdf)).\n",
"The extensive multilingual ASR knowledge acquired by Whisper during pre-training \n",
"can be leveraged for other low-resource languages; through fine-tuning, the \n",
"pre-trained checkpoints can be adapted for specific datasets and languages \n",
"to further improve upon these results. We'll show just how Whisper can be fine-tuned \n",
"for low-resource languages in this Colab."
]
},
{
"cell_type": "markdown",
"id": "e59b91d6-be24-4b5e-bb38-4977ea143a72",
"metadata": {
"id": "e59b91d6-be24-4b5e-bb38-4977ea143a72"
},
"source": [
"\n",
"
Step | \n", "Training Loss | \n", "Validation Loss | \n", "Wer | \n", "
---|---|---|---|
4000 | \n", "0.147600 | \n", "0.322550 | \n", "44.976586 | \n", "
"
],
"text/plain": [
"