Update README.md

e4df77b verified 25 days ago

6.15 kB

	---
	license: apache-2.0
	base_model: openai/whisper-medium
	tags:
	- generated_from_trainer
	metrics:
	- bleu
	model-index:
	- name: whisper-medium-english-2-wolof
	results: []
	datasets:
	- bilalfaye/english-wolof-french-dataset
	language:
	- en
	- wo
	pipeline_tag: automatic-speech-recognition
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# whisper-medium-english-2-wolof

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset). The model is designed to translate English audio into Wolof text. Since the base Whisper model does not natively support Wolof, this fine-tuned version bridges that gap.
	It achieves the following results on the evaluation set:

	- Loss: 1.1668
	- Bleu: 34.6061

	## Model Description

	The model is based on OpenAI's Whisper architecture, fine-tuned to recognize and translate English speech to Wolof. It leverages the "medium" variant, offering a balance between accuracy and computational efficiency.

	## Intended Uses & Limitations

	Intended uses:
	- Automatic transcription and translation of English audio into Wolof text.
	- Assisting researchers and language learners working with English audio content.

	Limitations:
	- May struggle with heavy accents or noisy environments.
	- Performance may vary depending on speaker pronunciation and recording quality.

	## Training and Evaluation Data

	The model was fine-tuned on the [bilalfaye/english-wolof-french-dataset](https://huggingface.co/datasets/bilalfaye/english-wolof-french-dataset), which consists of English audio paired with Wolof translations.

	## Training Procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 32
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- training_steps: 20000
	- mixed_precision_training: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Bleu \|
	\|:-------------:\|:------:\|:-----:\|:---------------:\|:-------:\|
	\| 0.9771 \| 0.8941 \| 2000 \| 0.9736 \| 22.8506 \|
	\| 0.6832 \| 1.7881 \| 4000 \| 0.8379 \| 30.0113 \|
	\| 0.4568 \| 2.6822 \| 6000 \| 0.8083 \| 33.4759 \|
	\| 0.2623 \| 3.5762 \| 8000 \| 0.8506 \| 33.4723 \|
	\| 0.1608 \| 4.4703 \| 10000 \| 0.9128 \| 33.6342 \|
	\| 0.0758 \| 5.3643 \| 12000 \| 0.9808 \| 33.7770 \|
	\| 0.0315 \| 6.2584 \| 14000 \| 1.0546 \| 34.0842 \|
	\| 0.0133 \| 7.1524 \| 16000 \| 1.1085 \| 34.2531 \|
	\| 0.0057 \| 8.0465 \| 18000 \| 1.1455 \| 34.5325 \|
	\| 0.0046 \| 8.9405 \| 20000 \| 1.1668 \| 34.6061 \|


	### Framework versions

	- Transformers 4.41.2
	- Pytorch 2.4.0+cu121
	- Datasets 3.2.0
	- Tokenizers 0.19.1

	## Inference

	### Using Python Code

	```python
	! pip install transformers datasets torch

	import torch
	from transformers import WhisperForConditionalGeneration, WhisperProcessor
	from datasets import load_dataset

	# Load model and processor
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	model = WhisperForConditionalGeneration.from_pretrained("bilalfaye/whisper-medium-english-2-wolof").to(device)
	processor = WhisperProcessor.from_pretrained("bilalfaye/whisper-medium-english-2-wolof")

	# Load dataset
	streaming_dataset = load_dataset("bilalfaye/english-wolof-french-dataset", split="train", streaming=True)
	iterator = iter(streaming_dataset)
	sample = next(iterator)
	sample = next(iterator)
	sample = next(iterator)


	# Preprocess audio
	input_features = processor(sample["en_audio"]["audio"]["array"],
	sampling_rate=sample["en_audio"]["audio"]["sampling_rate"],
	return_tensors="pt").input_features.to(device)

	# Generate transcription
	predicted_ids = model.generate(input_features)
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

	print("Correct sentence:", sample["en"])
	print("Transcription:", transcription[0])
	```

	### Using Gradio Interface

	```python
	! pip install gradio

	from transformers import pipeline
	import gradio as gr
	import numpy as np


	# Load model pipeline
	device = "cuda:0" if torch.cuda.is_available() else "cpu"
	pipe = pipeline(task="automatic-speech-recognition", model="bilalfaye/whisper-medium-english-2-wolof", device=device)

	# Function for transcription
	def transcribe(audio):
	if audio is None:
	return "No audio provided. Please try again."

	if isinstance(audio, str):
	waveform, sample_rate = torchaudio.load(audio)
	elif isinstance(audio, tuple): # Case microphone (Gradio donne un tuple (fichier, sample_rate))
	waveform, sample_rate = torchaudio.load(audio[0])
	else:
	return "Invalid audio input format."

	if waveform.shape[0] > 1:
	mono_audio = waveform.mean(dim=0, keepdim=True)
	else:
	mono_audio = waveform

	target_sample_rate = 16000
	if sample_rate != target_sample_rate:
	resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=target_sample_rate)
	mono_audio = resampler(mono_audio)
	sample_rate = target_sample_rate

	mono_audio = mono_audio.squeeze(0).numpy().astype(np.float32)

	result = pipe({"array": mono_audio, "sampling_rate": sample_rate})
	return result['text']


	# Create Gradio interfaces
	interface = gr.Interface(
	fn=transcribe,
	inputs=gr.Audio(sources=["upload", "microphone"], type="filepath"),
	outputs="text",
	title="Whisper Medium English Translation",
	description="Record audio in English and translate it to Wolof using a fine-tuned Whisper medium model.",
	#live=True,
	)


	app = gr.TabbedInterface(
	[interface],
	["Use Uploaded File or Microphone"]
	)

	app.launch(debug=True, share=True)
	```

	Author
	- Bilal FAYE