TalTechNLP
/

whisper-medium-et

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

Tanel commited on Mar 23, 2023

Commit

6759493

·

1 Parent(s): 15f933e

Update README.md

Files changed (1) hide show

README.md +57 -0

README.md CHANGED Viewed

@@ -39,3 +39,60 @@ model-index:
       type: cer
       value: 3.194
 ---

       type: cer
       value: 3.194
 ---
+# Whisper-medium-et
+This is a Whisper-medium model [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) finetuned on around 800 hours of diverse Estonian data.
+## Model description
+This is a general-purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
+## Intended uses & limitations
+This model is intended for general-purpose speech recognition, such as broadcast conversations, interviews, talks, etc.
+## How to use
+Use as any other Whisper model via HF transformers, or use a faster decoder like [faster-whisper](https://github.com/guillaumekln/faster-whisper).
+#### Limitations and bias
+Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:
+  * Speech containing technical and other domain-specific terms
+  * Children's speech
+  * Non-native speech
+  * Speech recorded under very noisy conditions or with a microphone far from the speaker
+  * Very spontaneous and overlapping speech
+## Training data
+Acoustic training data:
+| Type                  | Amount (h) |
+|-----------------------|:------:|
+| Broadcast speech      |   591  |
+| Spontaneous speech    |   53   |
+| Elderly speech corpus |   53   |
+| Talks, lectures       |   49   |
+| Parliament speeches   |   31   |
+| *Total*               |   *761*  |
+## Training procedure
+Finetuned using Espnet, and then comverted to transformers format using [this](https://gist.github.com/alumae/2dcf473b667cec9d513b80ea24e94672) script.
+Finetuning procedure is similar to [this](https://huggingface.co/espnet/shihlun_asr_whisper_medium_finetuned_librispeech100) model.
+## Evaluation results
+### WER
+WER results below are obtained using greedy decoding (i.e., beam size 1).
+|Dataset | WER |
+|---|---|
+| Common Voice 8.0 | 13.8 |
+| Common Voice 11.0 | 14.7 |