|
--- |
|
license: apache-2.0 |
|
base_model: openai/whisper-medium |
|
tags: |
|
- generated_from_trainer |
|
metrics: |
|
- bleu |
|
model-index: |
|
- name: whisper-medium-english-2-wolof |
|
results: [] |
|
datasets: |
|
- bilalfaye/english-wolof-french-dataset |
|
language: |
|
- en |
|
- wo |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# whisper-medium-english-2-wolof |
|
|
|
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset). The model is designed to translate English audio into Wolof text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap. |
|
It achieves the following results on the evaluation set: |
|
|
|
- Loss: 1.1668 |
|
- Bleu: 34.6061 |
|
|
|
## Model Description |
|
|
|
The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate English speech to Wolof. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency. |
|
|
|
## Intended Uses & Limitations |
|
|
|
**Intended uses:** |
|
- Automatic transcription and translation of English audio into Wolof text. |
|
- Assisting researchers and language learners working with English audio content. |
|
|
|
**Limitations:** |
|
- May struggle with heavy accents or noisy environments. |
|
- Performance may vary depending on speaker pronunciation and recording quality. |
|
|
|
## Training and Evaluation Data |
|
|
|
The model was fine-tuned on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset), which consists of English audio paired with Wolof translations. |
|
|
|
## Training Procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 32 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 500 |
|
- training_steps: 20000 |
|
- mixed_precision_training: Native AMP |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Bleu | |
|
|:-------------:|:------:|:-----:|:---------------:|:-------:| |
|
| 0.9771 | 0.8941 | 2000 | 0.9736 | 22.8506 | |
|
| 0.6832 | 1.7881 | 4000 | 0.8379 | 30.0113 | |
|
| 0.4568 | 2.6822 | 6000 | 0.8083 | 33.4759 | |
|
| 0.2623 | 3.5762 | 8000 | 0.8506 | 33.4723 | |
|
| 0.1608 | 4.4703 | 10000 | 0.9128 | 33.6342 | |
|
| 0.0758 | 5.3643 | 12000 | 0.9808 | 33.7770 | |
|
| 0.0315 | 6.2584 | 14000 | 1.0546 | 34.0842 | |
|
| 0.0133 | 7.1524 | 16000 | 1.1085 | 34.2531 | |
|
| 0.0057 | 8.0465 | 18000 | 1.1455 | 34.5325 | |
|
| 0.0046 | 8.9405 | 20000 | 1.1668 | 34.6061 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.41.2 |
|
- Pytorch 2.4.0+cu121 |
|
- Datasets 3.2.0 |
|
- Tokenizers 0.19.1 |
|
|
|
## Inference |
|
|
|
### Using Python Code |
|
|
|
```python |
|
! pip install transformers datasets torch |
|
|
|
import torch |
|
from transformers import WhisperForConditionalGeneration, WhisperProcessor |
|
from datasets import load_dataset |
|
|
|
# Load model and processor |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-english-2-wolof").to(device) |
|
processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-english-2-wolof") |
|
|
|
# Load dataset |
|
streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True) |
|
iterator = iter(streaming_dataset) |
|
sample = next(iterator) |
|
sample = next(iterator) |
|
sample = next(iterator) |
|
|
|
|
|
# Preprocess audio |
|
input_features = processor(sample["en_audio"]["audio"]["array"], |
|
sampling_rate=sample["en_audio"]["audio"]["sampling_rate"], |
|
return_tensors="pt").input_features.to(device) |
|
|
|
# Generate transcription |
|
predicted_ids = model.generate(input_features) |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) |
|
|
|
print("Correct sentence:", sample["en"]) |
|
print("Transcription:", transcription[0]) |
|
``` |
|
|
|
### Using Gradio Interface |
|
|
|
```python |
|
! pip install gradio |
|
|
|
from transformers import pipeline |
|
import gradio as gr |
|
import numpy as np |
|
|
|
|
|
# Load model pipeline |
|
device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-english-2-wolof", device=device) |
|
|
|
# Function for transcription |
|
def transcribe(audio): |
|
if audio is None: |
|
return "No audio provided. Please try again." |
|
|
|
if isinstance(audio, str): |
|
waveform, sample_rate = torchaudio.load(audio) |
|
elif isinstance(audio, tuple): # Case microphone (Gradio donne un tuple (fichier, sample_rate)) |
|
waveform, sample_rate = torchaudio.load(audio[0]) |
|
else: |
|
return "Invalid audio input format." |
|
|
|
if waveform.shape[0] > 1: |
|
mono_audio = waveform.mean(dim=0, keepdim=True) |
|
else: |
|
mono_audio = waveform |
|
|
|
target_sample_rate = 16000 |
|
if sample_rate != target_sample_rate: |
|
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate) |
|
mono_audio = resampler(mono_audio) |
|
sample_rate = target_sample_rate |
|
|
|
mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32) |
|
|
|
result = pipe({"array": mono_audio, "sampling_rate": sample_rate}) |
|
return result['text'] |
|
|
|
|
|
# Create Gradio interfaces |
|
interface = gr.Interface( |
|
fn=transcribe, |
|
inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"), |
|
outputs="text", |
|
title="Whisper Medium English Translation", |
|
description="Record audio in English and translate it to Wolof using a fine-tuned Whisper medium model.", |
|
#live=True, |
|
) |
|
|
|
|
|
app = gr.TabbedInterface( |
|
[interface], |
|
["Use Uploaded File or Microphone"] |
|
) |
|
|
|
app.launch(debug=True, share=True) |
|
``` |
|
|
|
**Author** |
|
- Bilal FAYE |