File size: 5,367 Bytes
1a9df55
673eaaf
 
1a9df55
 
673eaaf
1a9df55
 
673eaaf
1a9df55
 
 
673eaaf
1a9df55
 
 
 
 
673eaaf
 
1a9df55
 
 
 
 
 
 
 
 
469502d
1a9df55
469502d
1a9df55
2111b26
1a9df55
2111b26
1a9df55
469502d
1a9df55
2111b26
1a9df55
469502d
 
2111b26
 
 
 
1a9df55
 
525cab3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e728f8
525cab3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a9df55
469502d
 
 
 
 
 
 
 
 
 
1a9df55
 
 
 
 
2111b26
 
 
 
 
1a9df55
 
 
 
 
 
 
469502d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Medium Portuguese
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 pt
      type: mozilla-foundation/common_voice_11_0
      config: pt
      split: test
      args: pt
    metrics:
    - name: Wer
      type: wer
      value: 6.5785713084850626
---

# Whisper Medium Portuguese 🇧🇷🇵🇹

Bem-vindo ao whisper medium para transcrição em português 👋🏻

If you are looking to **quickly**, and **reliably**, transcribe Portuguese audio to text, you are in the right place!

With a state-of-the-art [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) of just **6.579** in Common Voice 11, this model offers an **x2** precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co/openai/whisper-medium) model it delivers an **x1.2** improvement 🚀. 

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset. 

The following table displays a **comparison** between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co/models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣:

| Model                                            | WER    | Parameters |
|--------------------------------------------------|:--------:|:------------:|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                            | 8.100   | 769M       |
| [jlondonobo/whisper-medium-pt](https://huggingface.co/jlondonobo/whisper-medium-pt)                     | **6.579** 🤗  | 769M       |
| [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) | 11.310  | 317M       |
| [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese)    | 20.080 | 317M       |


### How to use
You can use this model directly with a pipeline. This is especially useful for short audio. For **long-form** transcriptions please use the code in the [Long-form transcription](#long-form-transcription) section.

```bash
pip install git+https://github.com/huggingface/transformers --force-reinstall
pip install torch
```

```python
>>> from transformers import pipeline
>>> import torch

>>> device = 0 if torch.cuda.is_available() else "cpu"

# Load the pipeline
>>> transcribe = pipeline(
...     task="automatic-speech-recognition",
...     model="jlondonobo/whisper-medium-pt",
...     chunk_length_s=30,
...     device=device,
... )

# Force model to transcribe in Portuguese
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe")

# Transcribe your audio file
>>> transcribe("audio.m4a")["text"]
'Eu falo português.'
```

#### Long-form transcription
To improve the performance of long-form transcription you can convert the HF model into a `whisper` model, and use the original paper's matching algorithm. To do this, you must install `whisper` and a set of tools developed by [@bayartsogt](https://huggingface.co/bayartsogt).
```bash
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
```

Then convert the HuggingFace model and transcribe:
```python
>>> import torch
>>> import whisper
>>> from multiple_datasets.hub_default_utils import convert_hf_whisper

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

# Write HF model to local whisper model
>>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt")

# Load the whisper model
>>> model = whisper.load_model("local_whisper_model.pt", device=device)

# Transcribe arbitrarily long audio
>>> model.transcribe("long_audio.m4a", language="pt")["text"]
'Olá eu sou o José. Tenho 23 anos e trabalho...'
```


### Training hyperparameters
We used the following hyperparameters for training:
- `learning_rate`: 1e-05
- `train_batch_size`: 32
- `eval_batch_size`: 16
- `seed`: 42
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 500
- `training_steps`: 5000
- `mixed_precision_training`: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Wer    |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0698        | 1.09  | 1000 | 0.1876          | 7.189 |
| 0.0218        | 3.07  | 2000 | 0.2254          | 7.110 |
| 0.0053        | 5.06  | 3000 | 0.2711          | 6.969 |
| 0.0017        | 7.04  | 4000 | 0.3030          | 6.686 |
| 0.0005        | 9.02  | 5000 | 0.3205          | **6.579** 🤗 |


### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2