File size: 5,367 Bytes

---
language:
- pt
license: apache-2.0
tags:
- whisper-event
- generated_from_trainer
datasets:
- mozilla-foundation/common_voice_11_0
metrics:
- wer
model-index:
- name: Whisper Medium Portuguese
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: mozilla-foundation/common_voice_11_0 pt
      type: mozilla-foundation/common_voice_11_0
      config: pt
      split: test
      args: pt
    metrics:
    - name: Wer
      type: wer
      value: 6.5785713084850626
---

# Whisper Medium Portuguese 🇧🇷🇵🇹

Bem-vindo ao whisper medium para transcrição em português 👋🏻

If you are looking to **quickly**, and **reliably**, transcribe Portuguese audio to text, you are in the right place!

With a state-of-the-art [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) of just **6.579** in Common Voice 11, this model offers an **x2** precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co/openai/whisper-medium) model it delivers an **x1.2** improvement 🚀. 

This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset. 

The following table displays a **comparison** between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co/models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣:

| Model                                            | WER    | Parameters |
|--------------------------------------------------|:--------:|:------------:|
| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                            | 8.100   | 769M       |
| [jlondonobo/whisper-medium-pt](https://huggingface.co/jlondonobo/whisper-medium-pt)                     | **6.579** 🤗  | 769M       |
| [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) | 11.310  | 317M       |
| [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese)    | 20.080 | 317M       |


### How to use
You can use this model directly with a pipeline. This is especially useful for short audio. For **long-form** transcriptions please use the code in the [Long-form transcription](#long-form-transcription) section.

```bash
pip install git+https://github.com/huggingface/transformers --force-reinstall
pip install torch
```

```python
>>> from transformers import pipeline
>>> import torch

>>> device = 0 if torch.cuda.is_available() else "cpu"

# Load the pipeline
>>> transcribe = pipeline(
...     task="automatic-speech-recognition",
...     model="jlondonobo/whisper-medium-pt",
...     chunk_length_s=30,
...     device=device,
... )

# Force model to transcribe in Portuguese
>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe")

# Transcribe your audio file
>>> transcribe("audio.m4a")["text"]
'Eu falo português.'
```

#### Long-form transcription
To improve the performance of long-form transcription you can convert the HF model into a `whisper` model, and use the original paper's matching algorithm. To do this, you must install `whisper` and a set of tools developed by [@bayartsogt](https://huggingface.co/bayartsogt).
```bash
pip install git+https://github.com/openai/whisper.git
pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
```

Then convert the HuggingFace model and transcribe:
```python
>>> import torch
>>> import whisper
>>> from multiple_datasets.hub_default_utils import convert_hf_whisper

>>> device = "cuda" if torch.cuda.is_available() else "cpu"

# Write HF model to local whisper model
>>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt")

# Load the whisper model
>>> model = whisper.load_model("local_whisper_model.pt", device=device)

# Transcribe arbitrarily long audio
>>> model.transcribe("long_audio.m4a", language="pt")["text"]
'Olá eu sou o José. Tenho 23 anos e trabalho...'
```


### Training hyperparameters
We used the following hyperparameters for training:
- `learning_rate`: 1e-05
- `train_batch_size`: 32
- `eval_batch_size`: 16
- `seed`: 42
- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_steps`: 500
- `training_steps`: 5000
- `mixed_precision_training`: Native AMP

### Training results

| Training Loss | Epoch | Step | Validation Loss | Wer    |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| 0.0698        | 1.09  | 1000 | 0.1876          | 7.189 |
| 0.0218        | 3.07  | 2000 | 0.2254          | 7.110 |
| 0.0053        | 5.06  | 3000 | 0.2711          | 6.969 |
| 0.0017        | 7.04  | 4000 | 0.3030          | 6.686 |
| 0.0005        | 9.02  | 5000 | 0.3205          | **6.579** 🤗 |


### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.13.0+cu117
- Datasets 2.7.1.dev0
- Tokenizers 0.13.2