metadata
language: pt
datasets:
- common_voice
- mls
- cetuc
- lapsbm
- voxforge
- tedx
- sid
metrics:
- wer
tags:
- audio
- speech
- wav2vec2
- pt
- portuguese-speech-corpus
- automatic-speech-recognition
- speech
- PyTorch
license: apache-2.0
model-index:
- name: bp400-xlsr
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
bp500-base100k_voxpopuli: Wav2vec 2.0 with Brazilian Portuguese (BP) Dataset
This is a the demonstration of a fine-tuned Wav2vec model for Brazilian Portuguese using the following datasets:
- CETUC: contains approximately 145 hours of Brazilian Portuguese speech distributed among 50 male and 50 female speakers, each pronouncing approximately 1,000 phonetically balanced sentences selected from the CETEN-Folha corpus.
- Common Voice 7.0: is a project proposed by Mozilla Foundation with the goal to create a wide open dataset in different languages. In this project, volunteers donate and validate speech using the oficial site.
- Lapsbm: "Falabrasil - UFPA" is a dataset used by the Fala Brasil group to benchmark ASR systems in Brazilian Portuguese. Contains 35 speakers (10 females), each one pronouncing 20 unique sentences, totalling 700 utterances in Brazilian Portuguese. The audios were recorded in 22.05 kHz without environment control.
- Multilingual Librispeech (MLS): a massive dataset available in many languages. The MLS is based on audiobook recordings in public domain like LibriVox. The dataset contains a total of 6k hours of transcribed data in many languages. The set in Portuguese used in this work (mostly Brazilian variant) has approximately 284 hours of speech, obtained from 55 audiobooks read by 62 speakers.
- Multilingual TEDx: a collection of audio recordings from TEDx talks in 8 source languages. The Portuguese set (mostly Brazilian Portuguese variant) contains 164 hours of transcribed speech.
- Sidney (SID): contains 5,777 utterances recorded by 72 speakers (20 women) from 17 to 59 years old with fields such as place of birth, age, gender, education, and occupation;
- VoxForge: is a project with the goal to build open datasets for acoustic models. The corpus contains approximately 100 speakers and 4,130 utterances of Brazilian Portuguese, with sample rates varying from 16kHz to 44.1kHz.
These datasets were combined to build a larger Brazilian Portuguese dataset. All data was used for training except Common Voice dev/test sets, that were used for validation/test respectively. We also made test sets for all the gathered datasets.
Dataset | Train | Valid | Test |
---|---|---|---|
CETUC | 94.0h | -- | 5.4h |
Common Voice | 37.8h | 8.9h | 9.5h |
LaPS BM | 0.8h | -- | 0.1h |
MLS | 161.0h | -- | 3.7h |
Multilingual TEDx (Portuguese) | 148.9h | -- | 1.8h |
SID | 7.2h | -- | 1.0h |
VoxForge | 3.9h | -- | 0.1h |
Total | 453.6h | 8.9h | 21.6h |
The original model was fine-tuned using fairseq. This notebook uses a converted version of the original one. The link to the original fairseq model is available here.
Summary
CETUC | CV | LaPS | MLS | SID | TEDx | VF | AVG | |
---|---|---|---|---|---|---|---|---|
bp_500-base100k_voxpopuli (demonstration below) | 0.142 | 0.201 | 0.052 | 0.224 | 0.102 | 0.317 | 0.048 | 0.155 |
bp_500-base100k_voxpopuli + 4-gram (demonstration below) | 0.099 | 0.149 | 0.047 | 0.192 | 0.115 | 0.371 | 0.127 | 0.157 |
Transcription examples
Text | Transcription |
---|---|
qual o instagram dele | qualo está gramedele |
o capitão foi expulso do exército porque era doido | o capitãl foi exposo do exército porque era doido |
também por que não | também porque não |
não existe tempo como o presente | não existe tempo como o presente |
eu pulei para salvar rachel | eu pulei para salvar haquel |
augusto cezar passos marinho | augusto cesa passoesmarinho |
Demonstration
MODEL_NAME = "lgris/bp500-base100k_voxpopuli"
Imports and dependencies
%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip
import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys
Helpers
chars_to_ignore_regex = '[\,\?\.\!\;\:\"]' # noqa: W605
def map_to_array(batch):
speech, _ = torchaudio.load(batch["path"])
batch["speech"] = speech.squeeze(0).numpy()
batch["sampling_rate"] = 16_000
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
batch["target"] = batch["sentence"]
return batch
def calc_metrics(truths, hypos):
wers = []
mers = []
wils = []
for t, h in zip(truths, hypos):
try:
wers.append(jiwer.wer(t, h))
mers.append(jiwer.mer(t, h))
wils.append(jiwer.wil(t, h))
except: # Empty string?
pass
wer = sum(wers)/len(wers)
mer = sum(mers)/len(mers)
wil = sum(wils)/len(wils)
return wer, mer, wil
def load_data(dataset):
data_files = {'test': f'{dataset}/test.csv'}
dataset = load_dataset('csv', data_files=data_files)["test"]
return dataset.map(map_to_array)
Model
class STT:
def __init__(self,
model_name,
device='cuda' if torch.cuda.is_available() else 'cpu',
lm=None):
self.model_name = model_name
self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
self.processor = Wav2Vec2Processor.from_pretrained(model_name)
self.vocab_dict = self.processor.tokenizer.get_vocab()
self.sorted_dict = {
k.lower(): v for k, v in sorted(self.vocab_dict.items(),
key=lambda item: item[1])
}
self.device = device
self.lm = lm
if self.lm:
self.lm_decoder = build_ctcdecoder(
list(self.sorted_dict.keys()),
self.lm
)
def batch_predict(self, batch):
features = self.processor(batch["speech"],
sampling_rate=batch["sampling_rate"][0],
padding=True,
return_tensors="pt")
input_values = features.input_values.to(self.device)
with torch.no_grad():
logits = self.model(input_values).logits
if self.lm:
logits = logits.cpu().numpy()
batch["predicted"] = []
for sample_logits in logits:
batch["predicted"].append(self.lm_decoder.decode(sample_logits))
else:
pred_ids = torch.argmax(logits, dim=-1)
batch["predicted"] = self.processor.batch_decode(pred_ids)
return batch
Download datasets
%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/
%cd bp_dataset
/content/bp_dataset
Tests
stt = STT(MODEL_NAME)
CETUC
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)
CETUC WER: 0.1419179499917191
Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)
CV WER: 0.20079950312040154
LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)
Laps WER: 0.052780934343434324
MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)
MLS WER: 0.22413887199364113
SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)
Sid WER: 0.1019041538671034
TEDx
ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)
TEDx WER: 0.31711268778273327
VoxForge
ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)
VoxForge WER: 0.04826433982683982
Tests with LM
!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP # trained with wikipedia
stt = STT(MODEL_NAME, lm='pt-BR-wiki.word.4-gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg # trained with bp
# stt = STT(MODEL_NAME, lm='pt-BR.word.4-gram.arpa')
Cetuc
ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)
CETUC WER: 0.099518615112877
Common Voice
ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)
CV WER: 0.1488912889506362
LaPS
ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)
Laps WER: 0.047080176767676764
MLS
ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)
MLS WER: 0.19220291966887196
SID
ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)
Sid WER: 0.11535498771650306
TEDx
ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)
TEDx WER: 0.3707890073539895
VoxForge
ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8)
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)
VoxForge WER: 0.12682088744588746