Wav2Vec2-Large-XLSR-53-Spanish-With-LM

This is a model copy of Wav2Vec2-Large-XLSR-53-Spanish that has language model support.

This model card can be seen as a demo for the pyctcdecode integration with Transformers led by this PR. The PR explains in-detail how the integration works.

In a nutshell: This PR adds a new Wav2Vec2WithLMProcessor class as drop-in replacement for Wav2Vec2Processor.

The only change from the existing ASR pipeline will be:

Changes

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F


model_id = "patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm"

sample = next(iter(load_dataset("common_voice", "es", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()

model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

-prediction_ids = torch.argmax(logits, dim=-1)
-transcription = processor.batch_decode(prediction_ids)
+transcription = processor.batch_decode(logits.numpy()).text
# => 'bien y qué regalo vas a abrir primero'

Improvement

This model has been compared on 512 speech samples from the Spanish Common Voice Test set and gives a nice 20 % performance boost:

The results can be reproduced by running from this model repository:

Model WER CER
patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm 8.44% 2.93%
jonatasgrosman/wav2vec2-large-xlsr-53-spanish 10.20% 3.24%
bash run_ngram_wav2vec2.py 1 512
bash run_ngram_wav2vec2.py 0 512

with run_ngram_wav2vec2.py being https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm/blob/main/run_ngram_wav2vec2.py

Downloads last month
1,406
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm

Spaces using patrickvonplaten/wav2vec2-large-xlsr-53-spanish-with-lm 6