e1d8e8a almost 2 years ago

7.89 kB

	---
	license: apache-2.0
	language: fr
	library_name: transformers
	thumbnail: null
	tags:
	- automatic-speech-recognition
	- hf-asr-leaderboard
	- whisper-event
	datasets:
	- mozilla-foundation/common_voice_11_0
	metrics:
	- wer
	model-index:
	- name: Fine-tuned whisper-medium model for ASR in French
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 11.0
	type: mozilla-foundation/common_voice_11_0
	config: fr
	split: test
	args: fr
	metrics:
	- name: WER (Greedy)
	type: wer
	value: 9.03
	- name: WER (Beam 5)
	type: wer
	value: 8.54
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Multilingual LibriSpeech (MLS)
	type: facebook/multilingual_librispeech
	config: french
	split: test
	args: french
	metrics:
	- name: WER (Greedy)
	type: wer
	value: 6.34
	- name: WER (Beam 5)
	type: wer
	value: 5.86
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: VoxPopuli
	type: facebook/voxpopuli
	config: fr
	split: test
	args: fr
	metrics:
	- name: WER (Greedy)
	type: wer
	value: 11.64
	- name: WER (Beam 5)
	type: wer
	value: 11.35
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Fleurs
	type: google/fleurs
	config: fr_fr
	split: test
	args: fr_fr
	metrics:
	- name: WER (Greedy)
	type: wer
	value: 7.13
	- name: WER (Beam 5)
	type: wer
	value: 6.85
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: African Accented French
	type: gigant/african_accented_french
	config: fr
	split: test
	args: fr
	metrics:
	- name: WER (Greedy)
	type: wer
	value: 8.88
	- name: WER (Beam 5)
	type: wer
	value: 7.02
	---

	<style>
	img {
	display: inline;
	}
	</style>

	![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey)
	![Model size](https://img.shields.io/badge/Params-769M-lightgrey)
	![Language](https://img.shields.io/badge/Language-French-lightgrey)

	# Fine-tuned whisper-medium model for ASR in French

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. This model also predicts casing and punctuation.

	## Performance

	Below are the WERs of the pre-trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli) and [Fleurs](https://huggingface.co/datasets/google/fleurs). These results are reported in the original [paper](https://cdn.openai.com/papers/whisper.pdf).

	\| Model \| Common Voice 9.0 \| MLS \| VoxPopuli \| Fleurs \|
	\| --- \| :---: \| :---: \| :---: \| :---: \|
	\| [openai/whisper-small](https://huggingface.co/openai/whisper-small) \| 22.7 \| 16.2 \| 15.7 \| 15.0 \|
	\| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) \| 16.0 \| 8.9 \| 12.2 \| 8.7 \|
	\| [openai/whisper-large](https://huggingface.co/openai/whisper-large) \| 14.7 \| 8.9 \| 11.0 \| 7.7 \|
	\| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) \| 13.9 \| 7.3 \| 11.4 \| 8.3 \|

	Below are the WERs of the fine-tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), and [Fleurs](https://huggingface.co/datasets/google/fleurs). Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as `WER (greedy search) / WER (beam search with beam width 5)`.

	\| Model \| Common Voice 11.0 \| MLS \| VoxPopuli \| Fleurs \|
	\| --- \| :---: \| :---: \| :---: \| :---: \|
	\| [bofenghuang/whisper-small-cv11-french](https://huggingface.co/bofenghuang/whisper-small-cv11-french) \| 11.76 / 10.99 \| 9.65 / 8.91 \| 14.45 / 13.66 \| 10.76 / 9.83 \|
	\| [bofenghuang/whisper-medium-cv11-french](https://huggingface.co/bofenghuang/whisper-medium-cv11-french) \| 9.03 / 8.54 \| 6.34 / 5.86 \| 11.64 / 11.35 \| 7.13 / 6.85 \|
	\| [bofenghuang/whisper-medium-french](https://huggingface.co/bofenghuang/whisper-medium-french) \| 9.03 / 8.73 \| 4.60 / 4.44 \| 9.53 / 9.46 \| 6.33 / 5.94 \|
	\| [bofenghuang/whisper-large-v2-cv11-french](https://huggingface.co/bofenghuang/whisper-large-v2-cv11-french) \| 8.05 / 7.67 \| 5.56 / 5.28 \| 11.50 / 10.69 \| 5.42 / 5.05 \|
	\| [bofenghuang/whisper-large-v2-french](https://huggingface.co/bofenghuang/whisper-large-v2-french) \| 8.15 / 7.83 \| 4.20 / 4.03 \| 9.10 / 8.66 \| 5.22 / 4.98 \|

	## Usage

	Inference with 🤗 Pipeline

	```python
	import torch

	from datasets import load_dataset
	from transformers import pipeline

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	# Load pipeline
	pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-french", device=device)

	# NB: set forced_decoder_ids for generation utils
	pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe")

	# Load data
	ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
	test_segment = next(iter(ds_mcv_test))
	waveform = test_segment["audio"]

	# Run
	generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy
	# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search

	# Normalise predicted sentences if necessary
	```

	Inference with 🤗 low-level APIs

	```python
	import torch
	import torchaudio

	from datasets import load_dataset
	from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

	# Load model
	model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-french").to(device)
	processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-french", language="french", task="transcribe")

	# NB: set forced_decoder_ids for generation utils
	model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe")

	# 16_000
	model_sample_rate = processor.feature_extractor.sampling_rate

	# Load data
	ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True)
	test_segment = next(iter(ds_mcv_test))
	waveform = torch.from_numpy(test_segment["audio"]["array"])
	sample_rate = test_segment["audio"]["sampling_rate"]

	# Resample
	if sample_rate != model_sample_rate:
	resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
	waveform = resampler(waveform)

	# Get feat
	inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")
	input_features = inputs.input_features
	input_features = input_features.to(device)

	# Generate
	generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy
	# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search

	# Detokenize
	generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

	# Normalise predicted sentences if necessary
	```