File size: 2,707 Bytes
1edfc52 45ccceb 0fddce5 45ccceb dfefaea 45ccceb 3dd4466 45ccceb dfefaea 15f933e dfefaea 45ccceb 6759493 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
---
license: cc-by-4.0
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
language: et
model-index:
- name: TalTechNLP/whisper-medium-et
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 11
type: mozilla-foundation/common_voice_11_0
config: et
split: test
metrics:
- name: Test WER
type: wer
value: 14.66
- name: Test CER
type: cer
value: 3.76
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 8
type: mozilla-foundation/common_voice_8_0
config: et
split: test
metrics:
- name: Test WER
type: wer
value: 13.793
- name: Test CER
type: cer
value: 3.194
---
# Whisper-medium-et
This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data.
## Model description
This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
## Intended uses & limitations
This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
## How to use
Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper).
#### Limitations and bias
Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
* Speech containing technical and other domain-specific terms
* Children's speech
* Non-native speech
* Speech recorded under very noisy conditions or with a microphone far from the speaker
* Very spontaneous and overlapping speech
## Training data
Acoustic training data:
| Type | Amount (h) |
|-----------------------|:------:|
| Broadcast speech | 591 |
| Spontaneous speech | 53 |
| Elderly speech corpus | 53 |
| Talks, lectures | 49 |
| Parliament speeches | 31 |
| *Total* | *761* |
## Training procedure
Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script.
Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
## Evaluation results
### WER
WER results below are obtained using greedy decoding (i.e., beam size 1).
|Dataset | WER |
|---|---|
| Common Voice 8.0 | 13.8 |
| Common Voice 11.0 | 14.7 |
|