|
--- |
|
license: apache-2.0 |
|
language: fr |
|
library_name: transformers |
|
thumbnail: null |
|
tags: |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
- whisper-event |
|
datasets: |
|
- mozilla-foundation/common_voice_11_0 |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: Fine-tuned whisper-medium model for ASR in French |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 11.0 |
|
type: mozilla-foundation/common_voice_11_0 |
|
config: fr |
|
split: test |
|
args: fr |
|
metrics: |
|
- name: WER (Greedy) |
|
type: wer |
|
value: 9.03 |
|
- name: WER (Beam 5) |
|
type: wer |
|
value: 8.54 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Multilingual LibriSpeech (MLS) |
|
type: facebook/multilingual_librispeech |
|
config: french |
|
split: test |
|
args: french |
|
metrics: |
|
- name: WER (Greedy) |
|
type: wer |
|
value: 6.34 |
|
- name: WER (Beam 5) |
|
type: wer |
|
value: 5.86 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: VoxPopuli |
|
type: facebook/voxpopuli |
|
config: fr |
|
split: test |
|
args: fr |
|
metrics: |
|
- name: WER (Greedy) |
|
type: wer |
|
value: 11.64 |
|
- name: WER (Beam 5) |
|
type: wer |
|
value: 11.35 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Fleurs |
|
type: google/fleurs |
|
config: fr_fr |
|
split: test |
|
args: fr_fr |
|
metrics: |
|
- name: WER (Greedy) |
|
type: wer |
|
value: 7.13 |
|
- name: WER (Beam 5) |
|
type: wer |
|
value: 6.85 |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: African Accented French |
|
type: gigant/african_accented_french |
|
config: fr |
|
split: test |
|
args: fr |
|
metrics: |
|
- name: WER (Greedy) |
|
type: wer |
|
value: 8.88 |
|
- name: WER (Beam 5) |
|
type: wer |
|
value: 7.02 |
|
--- |
|
|
|
<style> |
|
img { |
|
display: inline; |
|
} |
|
</style> |
|
|
|
![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey) |
|
![Model size](https://img.shields.io/badge/Params-769M-lightgrey) |
|
![Language](https://img.shields.io/badge/Language-French-lightgrey) |
|
|
|
# Fine-tuned whisper-medium model for ASR in French |
|
|
|
This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. **This model also predicts casing and punctuation.** |
|
|
|
## Performance |
|
|
|
*Below are the WERs of the pre-trained models on the [Common Voice 9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli) and [Fleurs](https://huggingface.co/datasets/google/fleurs). These results are reported in the original [paper](https://cdn.openai.com/papers/whisper.pdf).* |
|
|
|
| Model | Common Voice 9.0 | MLS | VoxPopuli | Fleurs | |
|
| --- | :---: | :---: | :---: | :---: | |
|
| [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 22.7 | 16.2 | 15.7 | 15.0 | |
|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 16.0 | 8.9 | 12.2 | 8.7 | |
|
| [openai/whisper-large](https://huggingface.co/openai/whisper-large) | 14.7 | 8.9 | **11.0** | **7.7** | |
|
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | **13.9** | **7.3** | 11.4 | 8.3 | |
|
|
|
*Below are the WERs of the fine-tuned models on the [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), and [Fleurs](https://huggingface.co/datasets/google/fleurs). Note that these evaluation datasets have been filtered and preprocessed to only contain French alphabet characters and are removed of punctuation outside of apostrophe. The results in the table are reported as `WER (greedy search) / WER (beam search with beam width 5)`.* |
|
|
|
| Model | Common Voice 11.0 | MLS | VoxPopuli | Fleurs | |
|
| --- | :---: | :---: | :---: | :---: | |
|
| [bofenghuang/whisper-small-cv11-french](https://huggingface.co/bofenghuang/whisper-small-cv11-french) | 11.76 / 10.99 | 9.65 / 8.91 | 14.45 / 13.66 | 10.76 / 9.83 | |
|
| [bofenghuang/whisper-medium-cv11-french](https://huggingface.co/bofenghuang/whisper-medium-cv11-french) | 9.03 / 8.54 | 6.34 / 5.86 | 11.64 / 11.35 | 7.13 / 6.85 | |
|
| [bofenghuang/whisper-medium-french](https://huggingface.co/bofenghuang/whisper-medium-french) | 9.03 / 8.73 | 4.60 / 4.44 | 9.53 / 9.46 | 6.33 / 5.94 | |
|
| [bofenghuang/whisper-large-v2-cv11-french](https://huggingface.co/bofenghuang/whisper-large-v2-cv11-french) | **8.05** / **7.67** | 5.56 / 5.28 | 11.50 / 10.69 | 5.42 / 5.05 | |
|
| [bofenghuang/whisper-large-v2-french](https://huggingface.co/bofenghuang/whisper-large-v2-french) | 8.15 / 7.83 | **4.20** / **4.03** | **9.10** / **8.66** | **5.22** / **4.98** | |
|
|
|
## Usage |
|
|
|
Inference with 🤗 Pipeline |
|
|
|
```python |
|
import torch |
|
|
|
from datasets import load_dataset |
|
from transformers import pipeline |
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load pipeline |
|
pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-cv11-french", device=device) |
|
|
|
# NB: set forced_decoder_ids for generation utils |
|
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe") |
|
|
|
# Load data |
|
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) |
|
test_segment = next(iter(ds_mcv_test)) |
|
waveform = test_segment["audio"] |
|
|
|
# Run |
|
generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy |
|
# generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search |
|
|
|
# Normalise predicted sentences if necessary |
|
``` |
|
|
|
Inference with 🤗 low-level APIs |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
|
|
from datasets import load_dataset |
|
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq |
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load model |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-cv11-french").to(device) |
|
processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-cv11-french", language="french", task="transcribe") |
|
|
|
# NB: set forced_decoder_ids for generation utils |
|
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe") |
|
|
|
# 16_000 |
|
model_sample_rate = processor.feature_extractor.sampling_rate |
|
|
|
# Load data |
|
ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) |
|
test_segment = next(iter(ds_mcv_test)) |
|
waveform = torch.from_numpy(test_segment["audio"]["array"]) |
|
sample_rate = test_segment["audio"]["sampling_rate"] |
|
|
|
# Resample |
|
if sample_rate != model_sample_rate: |
|
resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) |
|
waveform = resampler(waveform) |
|
|
|
# Get feat |
|
inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt") |
|
input_features = inputs.input_features |
|
input_features = input_features.to(device) |
|
|
|
# Generate |
|
generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy |
|
# generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search |
|
|
|
# Detokenize |
|
generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
|
# Normalise predicted sentences if necessary |
|
``` |