Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large

Fine-tuned facebook/wav2vec2-large-xlsr-53 on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly (without a language model) as follows.

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ja"
MODEL_ID = "Ivydata/wav2vec2-large-xlsr-53-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference: ", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Test Result

In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.

Model CER
Ivydata/wav2vec2-large-xlsr-53-japanese 27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese 34.18%
vumichien/wav2vec2-large-xlsr-japanese 37.72%

Test Inference Examples

Reference Prediction
ただ選択するのではなくどう考えて選択をするのか ただ洗濯するのではなくどう考えて洗択をするのか
この巨大な構造物を宇宙に作ることができた人間 この巨大な構造物を宇宙に作ることができた人間
何かしら嫌いになっていってしまったわけですよね 何にかしら気段になっっていってしまったおけどすね
そんな僕だからこそ言えることは筋肉を変えれば自分が変わってくるし んな僕らからこスえることは筋肉を変えれば自分が変わってくし
そうするとその言葉を使って未来のイメージを形作っていくことができると そうするとその言葉を使って未来のイメーージを形作っていことができると

Citation

If you want to cite this model you can use this:

@misc{Ivydata2023-wav2vec2-xlsr53-large-japanese,
  title={Fine-tuned Japanese Wav2Vec2 model for speech recognition using XLSR-53 large},
  author={Kosuke Suzuki},
  howpublished={\url{https://huggingface.co/Ivydata/wav2vec2-large-xlsr-53-japanese/}},
  year={2023}
}
Downloads last month
408
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Ivydata/wav2vec2-large-xlsr-53-japanese