--- language: ja datasets: - common_voice metrics: - wer - cer model-index: - name: wav2vec2-xls-r-300m finetuned on Japanese Hiragana with no word boundaries by Hyungshin Ryu of SLPlab results: - task: name: Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice Japanese type: common_voice args: ja metrics: - name: Test WER type: wer value: 90.66 - name: Test CER type: cer value: 19.35 --- # Wav2Vec2-XLS-R-300M-Japanese-Hiragana Fine-tuned [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Japanese Hiragana characters using the [Common Voice](https://huggingface.co/datasets/common_voice) and [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut). The sentence outputs do not contain word boundaries. Audio inputs should be sampled at 16kHz. ## Usage The model can be used directly as follows: ```python3 !pip install mecab-python3 !pip install unidic-lite !pip install pykakasi import torch import torchaudio from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor from datasets import load_dataset, load_metric import pykakasi import MeCab import re # load datasets, processor, and model test_dataset = load_dataset("common_voice", "ja", split="test") wer = load_metric("wer") cer = load_metric("cer") PTM = "slplab/wav2vec2-xls-r-300m-japanese-hiragana" print("PTM:", PTM) processor = Wav2Vec2Processor.from_pretrained(PTM) model = Wav2Vec2ForCTC.from_pretrained(PTM) device = "cuda" model.to(device) # preprocess datasets wakati = MeCab.Tagger("-Owakati") kakasi = pykakasi.kakasi() chars_to_ignore_regex = "[、,。]" def speech_file_to_array_fn_hiragana_nospace(batch): batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).strip() batch["sentence"] = ''.join([d['hira'] for d in kakasi.convert(batch["sentence"])]) speech_array, sampling_rate = torchaudio.load(batch["path"]) resampler = torchaudio.transforms.Resample(sampling_rate, 16000) batch["speech"] = resampler(speech_array).squeeze() return batch test_dataset = test_dataset.map(speech_file_to_array_fn_hiragana_nospace) #evaluate def evaluate(batch): inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values.to(device)).logits pred_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(pred_ids) return batch result = test_dataset.map(evaluate, batched=True, batch_size=8) for i in range(10): print("="*20) print("Prd:", result[i]["pred_strings"]) print("Ref:", result[i]["sentence"]) print("WER: {:2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"]))) print("CER: {:2f}%".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"]))) ``` | Original Text | Prediction | | ------------- | ------------- | | この料理は家庭で作れます。 | このりょうりはかていでつくれます | | 日本人は、決して、ユーモアと無縁な人種ではなかった。 | にっぽんじんはけしてゆうもあどむえんなじんしゅではなかった | | 木村さんに電話を貸してもらいました。 | きむらさんにでんわおかしてもらいました | ## Test Results **WER:** 90.66%, **CER:** 19.35% ## Training Trained on JSUT and train+valid set of Common Voice Japanese. Tested on test set of Common Voice Japanese.