File size: 3,394 Bytes
1a9df55
673eaaf
 
1a9df55
 
673eaaf
1a9df55
 
673eaaf
1a9df55
 
 
673eaaf
1a9df55
 
 
 
 
673eaaf
 
1a9df55
 
 
 
 
 
 
 
 
469502d
1a9df55
469502d
1a9df55
2111b26
1a9df55
2111b26
1a9df55
469502d
1a9df55
2111b26
1a9df55
469502d
 
2111b26
 
 
 
1a9df55
 
 
469502d
 
 
 
 
 
 
 
 
 
1a9df55
 
 
 
 
2111b26
 
 
 
 
1a9df55
 
 
 
 
 
 
469502d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Medium Portuguese
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 pt
      type: mozilla-foundation/common_voice_11_0
      config: pt
      split: test
      args: pt
    metrics:
    - name: Wer
      type: wer
      value: 6.5785713084850626
---

# Whisper Medium Portuguese 🇧🇷🇵🇹

Bem-vindo ao whisper medium para transcrição em português 👋🏻

If you are looking to **quickly**, and **reliably**, transcribe Portuguese audio to text, you are in the right place!

With a state-of-the-art [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) of just **6.579** in Common Voice 11, this model offers an **x2** precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co/openai/whisper-medium) model it delivers an **x1.2** improvement 🚀. 

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset. 

The following table displays a **comparison** between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co/models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣:

| Model                                            | WER    | Parameters |
|--------------------------------------------------|:--------:|:------------:|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                            | 8.100   | 769M       |
| [jlondonobo/whisper-medium-pt](https://huggingface.co/jlondonobo/whisper-medium-pt)                     | **6.579** 🤗  | 769M       |
| [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) | 11.310  | 317M       |
| [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese)    | 20.080 | 317M       |


### Training hyperparameters
We used the following hyperparameters for training:
- `learning_rate`: 1e-05
- `train_batch_size`: 32
- `eval_batch_size`: 16
- `seed`: 42
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 500
- `training_steps`: 5000
- `mixed_precision_training`: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Wer    |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0698        | 1.09  | 1000 | 0.1876          | 7.189 |
| 0.0218        | 3.07  | 2000 | 0.2254          | 7.110 |
| 0.0053        | 5.06  | 3000 | 0.2711          | 6.969 |
| 0.0017        | 7.04  | 4000 | 0.3030          | 6.686 |
| 0.0005        | 9.02  | 5000 | 0.3205          | **6.579** 🤗 |


### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2