Fine-tuned Japanese Whisper model for speech recognition using whisper-small

Fine-tuned openai/whisper-small on Japanese using Common Voice, JVS and JSUT. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

The model can be used directly as follows.

from transformers import WhisperForConditionalGeneration, WhisperProcessor
from datasets import load_dataset
import librosa
import torch

LANG_ID = "ja"
MODEL_ID = "Ivydata/whisper-small-japanese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")
processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(
    language="ja", task="transcribe"
)
model.config.suppress_tokens = []

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    batch["sampling_rate"] = sampling_rate
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
sample = test_dataset[0]
input_features = processor(sample["speech"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
# ['<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚<|endoftext|>']

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
# ['ζœ¨ζ‘γ•γ‚“γ«ι›»θ©±γ‚’θ²Έγ—γ¦γ‚‚γ‚‰γ„γΎγ—γŸγ€‚']

Test Result

In the table below I report the Character Error Rate (CER) of the model tested on TEDxJP-10K dataset.

Model CER
Ivydata/whisper-small-japanese 23.10%
Ivydata/wav2vec2-large-xlsr-53-japanese 27.87%
jonatasgrosman/wav2vec2-large-xlsr-53-japanese 34.18%
Downloads last month
22
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train Ivydata/whisper-small-japanese