--- license: apache-2.0 language: fr library_name: transformers thumbnail: null tags: - automatic-speech-recognition - hf-asr-leaderboard - whisper-event datasets: - mozilla-foundation/common_voice_11_0 metrics: - wer model-index: - name: Fine-tuned whisper-large-v2 model for ASR in French results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11.0 type: mozilla-foundation/common_voice_11_0 config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 8.05 - name: WER (Beam 5) type: wer value: 7.67 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Multilingual LibriSpeech (MLS) type: facebook/multilingual_librispeech config: french split: test args: french metrics: - name: WER (Greedy) type: wer value: 5.56 - name: WER (Beam 5) type: wer value: 5.28 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: VoxPopuli type: facebook/voxpopuli config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 11.50 - name: WER (Beam 5) type: wer value: 10.69 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Fleurs type: google/fleurs config: fr_fr split: test args: fr_fr metrics: - name: WER (Greedy) type: wer value: 5.42 - name: WER (Beam 5) type: wer value: 5.05 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: African Accented French type: gigant/african_accented_french config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 6.47 - name: WER (Beam 5) type: wer value: 5.95 --- ![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey) ![Model size](https://img.shields.io/badge/Params-1550M-lightgrey) ![Language](https://img.shields.io/badge/Language-French-lightgrey) # Fine-tuned whisper-large-v2 model for ASR in French This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2), trained on the mozilla-foundation/common_voice_11_0 fr dataset. When using the model make sure that your speech input is also sampled at 16Khz. **This model also predicts casing and punctuation.** ## Usage Inference with 🤗 Pipeline ```python import torch from datasets import load_dataset from transformers import pipeline device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load pipeline pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-large-v2-cv11-french", device=device) # NB: set forced_decoder_ids for generation utils pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe") # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = test_segment["audio"] # Run generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy # generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search # Normalise predicted sentences if necessary ``` Inference with 🤗 low-level APIs ```python import torch import torchaudio from datasets import load_dataset from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load model model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-large-v2-cv11-french").to(device) processor = AutoProcessor.from_pretrained("bofenghuang/whisper-large-v2-cv11-french", language="french", task="transcribe") # NB: set forced_decoder_ids for generation utils model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe") # 16_000 model_sample_rate = processor.feature_extractor.sampling_rate # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = torch.from_numpy(test_segment["audio"]["array"]) sample_rate = test_segment["audio"]["sampling_rate"] # Resample if sample_rate != model_sample_rate: resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) waveform = resampler(waveform) # Get feat inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt") input_features = inputs.input_features input_features = input_features.to(device) # Generate generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy # generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search # Detokenize generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] # Normalise predicted sentences if necessary ```