slplab commited on
Commit
abe4abe
1 Parent(s): 816e8ce

added a model card

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ja
3
+ datasets:
4
+ - common_voice
5
+ metrics:
6
+ - wer
7
+ - cer
8
+ model-index:
9
+ - name: wav2vec2-xls-r-300m finetuned on Japanese Hiragana with no word boundaries by Hyungshin Ryu of SLPlab
10
+ results:
11
+ - task:
12
+ name: Speech Recognition
13
+ type: automatic-speech-recognition
14
+ dataset:
15
+ name: Common Voice Japanese
16
+ type: common_voice
17
+ args: ja
18
+ metrics:
19
+ - name: Test WER
20
+ type: wer
21
+ value: 90.66
22
+ - name: Test CER
23
+ type: cer
24
+ value: 19.35
25
+ ---
26
+ # Wav2Vec2-XLS-R-300M-Japanese-Hiragana
27
+ Fine-tuned [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) on Japanese Hiragana characters using the [Common Voice](https://huggingface.co/datasets/common_voice) and [JSUT](https://sites.google.com/site/shinnosuketakamichi/publication/jsut).
28
+ The sentence outputs do not contain word boundaries. Audio inputs should be sampled at 16kHz.
29
+ ## Usage
30
+ The model can be used directly as follows:
31
+
32
+ ```python3
33
+ !pip install mecab-python3
34
+ !pip install unidic-lite
35
+ !pip install pykakasi
36
+
37
+
38
+ import torch
39
+ import torchaudio
40
+ from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
41
+ from datasets import load_dataset, load_metric
42
+ import pykakasi
43
+ import MeCab
44
+ import re
45
+
46
+
47
+ # load datasets, processor, and model
48
+ test_dataset = load_dataset("common_voice", "ja", split="test")
49
+ wer = load_metric("wer")
50
+ cer = load_metric("cer")
51
+ PTM = "slplab/wav2vec2-xls-r-300m-japanese-hiragana"
52
+ print("PTM:", PTM)
53
+ processor = Wav2Vec2Processor.from_pretrained(PTM)
54
+ model = Wav2Vec2ForCTC.from_pretrained(PTM)
55
+ device = "cuda"
56
+ model.to(device)
57
+
58
+
59
+ # preprocess datasets
60
+ wakati = MeCab.Tagger("-Owakati")
61
+ kakasi = pykakasi.kakasi()
62
+ chars_to_ignore_regex = "[、,。]"
63
+
64
+ def speech_file_to_array_fn_hiragana_nospace(batch):
65
+ batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).strip()
66
+ batch["sentence"] = ''.join([d['hira'] for d in kakasi.convert(batch["sentence"])])
67
+ speech_array, sampling_rate = torchaudio.load(batch["path"])
68
+ resampler = torchaudio.transforms.Resample(sampling_rate, 16000)
69
+ batch["speech"] = resampler(speech_array).squeeze()
70
+
71
+ return batch
72
+
73
+ test_dataset = test_dataset.map(speech_file_to_array_fn_hiragana_nospace)
74
+
75
+
76
+ #evaluate
77
+ def evaluate(batch):
78
+ inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
79
+ with torch.no_grad():
80
+ logits = model(inputs.input_values.to(device)).logits
81
+ pred_ids = torch.argmax(logits, dim=-1)
82
+ batch["pred_strings"] = processor.batch_decode(pred_ids)
83
+
84
+ return batch
85
+
86
+ result = test_dataset.map(evaluate, batched=True, batch_size=8)
87
+ for i in range(10):
88
+ print("="*20)
89
+ print("Prd:", result[i]["pred_strings"])
90
+ print("Ref:", result[i]["sentence"])
91
+
92
+ print("WER: {:2f}%".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
93
+ print("CER: {:2f}%".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
94
+ ```
95
+ | Original Text | Prediction |
96
+ | ------------- | ------------- |
97
+ | この料理は家庭で作れます。 | このりょうりはかていでつくれます |
98
+ | 日本人は、決して、ユーモアと無縁な人種ではなかった。 | にっぽんじんはけしてゆうもあどむえんなじんしゅではなかった |
99
+ | 木村さんに電話を貸してもらいました。 | きむらさんにでんわおかしてもらいました |
100
+
101
+ ## Test Results
102
+ **WER:** 90.66%,
103
+ **CER:** 19.35%
104
+ ## Training
105
+ Trained on JSUT and train+valid set of Common Voice Japanese. Tested on test set of Common Voice Japanese.