metadata

language: el
datasets:
  - common_voice
  - CSS10
metrics:
  - wer
tags:
  - audio
  - automatic-speech-recognition
  - speech
  - xlsr-fine-tuning-week
license: apache-2.0
model-index:
  - name: Greek XLSR Wav2Vec2 Large 53 - CV + CSS10
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice el
          type: common_voice
          args: el
        metrics:
          - name: Test WER
            type: wer
            value: 20.89

Wav2Vec2-Large-XLSR-53-greek

Fine-tuned facebook/wav2vec2-large-xlsr-53 on greek using the Common Voice and CSS10 datasets. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly (without a language model) as follows:

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "el", split="test") 

processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Evaluation

The model can be evaluated as follows on the greek test data of Common Voice.

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "el", split="test") 
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model = Wav2Vec2ForCTC.from_pretrained("PereLluis13/wav2vec2-large-xlsr-53-greek") 
model.to("cuda")

chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“\%\‘\”\�]'


resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 20.89 %

Training

The Common Voice train, validation, and CSS10 datasets were used for training, added as extra split to the dataset. The sampling rate and format of the CSS10 files is different, hence the function speech_file_to_array_fn was changed to:

    def speech_file_to_array_fn(batch):
        try:
            speech_array, sampling_rate = sf.read(batch["path"] + ".wav")
        except:
            speech_array, sampling_rate = librosa.load(batch["path"], sr = 16000, res_type='zero_order_hold')
            sf.write(batch["path"] + ".wav", speech_array, sampling_rate, subtype='PCM_24')
        batch["speech"] = speech_array
        batch["sampling_rate"] = sampling_rate
        batch["target_text"] = batch["text"]
        return batch

As suggested by Florian Zimmermeister.

The script used for training can be found in run_common_voice.py, still pending of PR. The only changes are to speech_file_to_array_fn. Batch size was kept at 32 (using gradient_accumulation_steps) using one of the OVH machines, with a V100 GPU (thank you very much OVH). The model trained for 40 epochs, the first 20 with the train+validation splits, and then extra split was added with the data from CSS10 at the 20th epoch.