File size: 4,254 Bytes
8d717e5 1828d04 8d717e5 1828d04 8d717e5 1828d04 8d717e5 863305b a455194 8d717e5 1828d04 8d717e5 a455194 8d717e5 1828d04 8d717e5 863305b 8d717e5 a455194 863305b a455194 863305b a455194 863305b a455194 863305b a455194 863305b a455194 863305b a455194 8d717e5 907a582 8d717e5 6c23e25 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b 907a582 863305b 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b 8d717e5 6c23e25 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b 8d717e5 863305b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
language:
- es
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
- cer
base_model: openai/whisper-large-v2
model-index:
- name: Whisper Large Spanish
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: mozilla-foundation/common_voice_11_0 es
type: mozilla-foundation/common_voice_11_0
config: es
split: test
args: es
metrics:
- type: wer
value: 4.673613637544826
name: WER
- type: cer
value: 1.5573247819517182
name: CER
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: google/fleurs es_419
type: google/fleurs
config: es_419
split: test
args: es_419
metrics:
- type: wer
value: 5.396216546072705
name: WER
- type: cer
value: 3.450427960057061
name: CER
---
# Whisper Large Spanish
This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on Spanish using the train split of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0).
## Usage
```python
from transformers import pipeline
transcriber = pipeline(
"automatic-speech-recognition",
model="jonatasgrosman/whisper-large-es-cv11"
)
transcriber.model.config.forced_decoder_ids = (
transcriber.tokenizer.get_decoder_prompt_ids(
language="es",
task="transcribe"
)
)
transcription = transcriber("path/to/my_audio.wav")
```
## Evaluation
I've performed the evaluation of the model using the test split of two datasets, the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) (same dataset used for the fine-tuning) and the [Fleurs](https://huggingface.co/datasets/google/fleurs) (dataset not seen during the fine-tuning). As Whisper can transcribe casing and punctuation, I've performed the model evaluation in 2 different scenarios, one using the raw text and the other using the normalized text (lowercase + removal of punctuations). Additionally, for the Fleurs dataset, I've evaluated the model in a scenario where there are no transcriptions of numerical values since the way these values are described in this dataset is different from how they are described in the dataset used in fine-tuning (Common Voice), so it is expected that this difference in the way of describing numerical values will affect the performance of the model for this type of transcription in Fleurs.
### Common Voice 11
| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) | 2.43 | 8.85 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization | 1.56 | 4.67 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 3.71 | 12.34 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 2.45 | 6.30 |
### Fleurs
| | CER | WER |
| --- | --- | --- |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) | 3.06 | 9.11 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization | 3.45 | 5.40 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + keep only non-numeric samples | 1.83 | 7.57 |
| [jonatasgrosman/whisper-large-es-cv11](https://huggingface.co/jonatasgrosman/whisper-large-es-cv11) + text normalization + keep only non-numeric samples | 2.36 | 4.14 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 2.30 | 8.50 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 2.76 | 4.79 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + keep only non-numeric samples | 1.93 | 7.33 |
| [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization + keep only non-numeric samples | 2.50 | 4.28 |
|