{ "cells": [ { "cell_type": "markdown", "id": "UyjAb5ynxdt6", "metadata": { "id": "UyjAb5ynxdt6" }, "source": [ "# Usage Guide for Fine-Tuned Whisper ASR Model on Chichewa\n", "\n", "This notebook provides a step-by-step guide to using a fine-tuned Whisper model for Automatic Speech Recognition (ASR) in **Chichewa**. The model has been fine-tuned on a Chichewa dataset consisting of approximately **24 hours of speech data**, and it has undergone several iterations, resulting in multiple versions with differing levels of performance. This guide directs you to the **best-performing model** to ensure optimal transcription accuracy.\n", "\n", "### Key Highlights:\n", "1. **Fine-Tuned on Chichewa**: The model has been specifically trained on 24 hours of Chichewa speech, making it well-suited for transcription tasks in this language. The fine-tuning process has significantly improved its performance for Chichewa ASR tasks.\n", " \n", "2. **No Tokenizer Required**: Unlike other Whisper models, this fine-tuned version does not use a tokenizer. Instead, the model works directly with the processor, simplifying the inference process.\n", "\n", "3. **Best Performance via Commit Hash**: To ensure you are using the best-performing model, you will load the model by passing the **`revision` parameter** along with the exact **commit hash** corresponding to the version that achieved the lowest Word Error Rate (WER). This ensures you are working with the most accurate version of the model.\n", "\n", "4. **Best Performance**: This notebook demonstrates how to load and use the model version with the best WER for Chichewa ASR tasks. Following this guide will allow you to achieve the highest transcription quality.\n", "\n", "### What You’ll Learn:\n", "- How to load and use the fine-tuned Whisper model specifically for Chichewa ASR.\n", "- How to process and transcribe audio files data using this model.\n", "- How to ensure you are using the version of the model that delivers the best transcription performance for Chichewa by utilizing the `revision` parameter with the commit hash.\n", "\n", "By following this guide, you’ll be equipped to leverage this specialized ASR model to produce high-quality transcriptions of Chichewa speech with minimal setup and effort.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "bf03a6b8-ad07-4116-b436-479042a84014", "metadata": { "id": "bf03a6b8-ad07-4116-b436-479042a84014" }, "outputs": [], "source": [ "import librosa\n", "from datasets import Audio, load_dataset, DatasetDict, load_from_disk\n", "\n", "import torch\n", "\n", "from transformers import (WhisperFeatureExtractor, WhisperTokenizer, \n", " WhisperProcessor, logging)\n", "from transformers import WhisperForConditionalGeneration\n", "from transformers import (pipeline, AutoModel, AutoTokenizer, \n", "AutoProcessor, AutoModelForSpeechSeq2Seq)\n", "\n", "# Suppress Warnings\n", "import warnings\n", "# Suppress all warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "# Suppress Hugging Face logs except errors\n", "logging.set_verbosity_error()\n" ] }, { "cell_type": "markdown", "id": "6c767de0", "metadata": {}, "source": [ "## 1. Best-Performing Model and WER Information\n", "\n", "In this section, we provide the commit hash for the best-performing model and its corresponding Word Error Rate (WER) based on evaluation.\n", "\n", "### Best-Performing Model:\n", "\n", "The model with the best WER performance can be loaded using the following **commit hash**:\n", "\n", "```python\n", "# Best-performing model hash\n", "HUGGINGFACE_MODEL_ID = \"dmatekenya/whisper-large-chichewa\"\n", "BEST_MODEL_COMMIT_HASH = \"bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a\" \n", "```\n", "\n", "### Note/Warning\n", "While the model endpoint remains the same (i.e., dmatekenya/whisper-large-chichewa), it is crucial to include the specific commit hash (COMMIT_HASH) provided above when loading the model to access the best-performing version. Please use this [full url](https://huggingface.co/dmatekenya/whisper-large-v3-chichewa/commit/bff60fb08ba9f294e05bfcab4306f30b6a0cfc0a) to access this commit. \n", "Without the commit hash, the latest version may be loaded, which could have a higher Word Error Rate (WER) than the version evaluated as best. To ensure you get the most accurate results, always include the commit hash.\n", "```\n" ] }, { "cell_type": "markdown", "id": "9a1ce662", "metadata": {}, "source": [ "## 2. Performing Speech-to-Text Inference on Audio Files \n", "In this section, I will demonstrate how to transcribe an individual audio file directly using the fine-tuned Whisper model, instead of loading a dataset through the HuggingFace `datasets` package. This approach is particularly useful when deploying the model within an application." ] }, { "cell_type": "code", "execution_count": 2, "id": "0bb57bec", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "99c5da0bca8e48569a1c505ce7435505", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00