Automatic Speech Recognition
Transformers
Safetensors
Japanese
whisper
audio
hf-asr-leaderboard
Eval Results
Inference Endpoints
asahi417 commited on
Commit
7e01ace
1 Parent(s): b168728

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -17
README.md CHANGED
@@ -67,32 +67,35 @@ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-larg
67
  teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
68
  Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
69
 
70
- As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
71
  (the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
72
- which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
73
  those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
74
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
75
 
76
- Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
77
  from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
78
  the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
79
 
80
  - ***CER***
81
 
82
- | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
83
- |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
84
- | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
85
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
86
- | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
87
- | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
88
- | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
 
 
89
 
90
  - ***WER***
91
 
92
- | Model | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
93
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
94
- | [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | **56.62** |
95
- | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
 
96
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
97
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
98
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
@@ -103,7 +106,7 @@ it inherits the benefit of the improved latency compared to [openai/whisper-larg
103
 
104
  | Model | Params / M | Rel. Latency |
105
  |----------------------------------------------------------------------------------------------|------------|--------------|
106
- | **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756** | **6.3** |
107
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
108
 
109
 
@@ -126,7 +129,7 @@ from transformers import pipeline
126
  from datasets import load_dataset
127
 
128
  # config
129
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
130
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
131
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
132
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
@@ -188,7 +191,7 @@ from transformers import pipeline
188
  from datasets import load_dataset
189
 
190
  # config
191
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
192
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
193
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
194
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
@@ -261,7 +264,7 @@ from evaluate import load
261
  from transformers.models.whisper.english_normalizer import BasicTextNormalizer
262
 
263
  # model config
264
- model_id = "kotoba-tech/kotoba-whisper-v1.0"
265
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
266
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
267
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 
67
  teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
68
  Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
69
 
70
+ As successor of our first model, [kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), we release ***kotoba-whisper-v2.0*** trained on the `all` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
71
  (the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
72
+ which amounts 7,203,957 audio clips (5 sec audio with 18 text tokens in average) after
73
  those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
74
  The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
75
 
76
+ Kotoba-whisper-v2.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
77
  from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
78
  the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
79
 
80
  - ***CER***
81
 
82
+ | Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
83
+ |:---------------------------------------------------------------------------------------------|-------------------:|-----------------:|--------------------:|
84
+ | [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)| 9.20 | 8.40 | **11.63** |
85
+ | [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 9.44 | 8.48 | **12.60** |
86
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **8.52** | **7.18** | 15.18 |
87
+ | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 11.34 | 9.87 | 29.56 |
88
+ | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 15.26 | 14.22 | 34.29 |
89
+ | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 46.86 | 35.69 | 96.69 |
90
+
91
 
92
  - ***WER***
93
 
94
+ | Model | [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) | [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) | [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
95
  |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
96
+ | [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0) | 58.8 | 63.7 | **55.6** |
97
+ | [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0) | 59.27 | 64.36 | 56.62 |
98
+ | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | **55.41** | **59.34** | 60.23 |
99
  | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) | 63.64 | 69.52 | 76.04 |
100
  | [openai/whisper-small](https://huggingface.co/openai/whisper-small) | 74.21 | 82.02 | 82.99 |
101
  | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) | 93.78 | 97.72 | 94.85 |
 
106
 
107
  | Model | Params / M | Rel. Latency |
108
  |----------------------------------------------------------------------------------------------|------------|--------------|
109
+ | **[kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)**| **756** | **6.3** |
110
  | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) | 1550 | 1.0 |
111
 
112
 
 
129
  from datasets import load_dataset
130
 
131
  # config
132
+ model_id = "kotoba-tech/kotoba-whisper-v2.0"
133
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
134
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
135
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 
191
  from datasets import load_dataset
192
 
193
  # config
194
+ model_id = "kotoba-tech/kotoba-whisper-v2.0"
195
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
196
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
197
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 
264
  from transformers.models.whisper.english_normalizer import BasicTextNormalizer
265
 
266
  # model config
267
+ model_id = "kotoba-tech/kotoba-whisper-v2.0"
268
  torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
269
  device = "cuda:0" if torch.cuda.is_available() else "cpu"
270
  model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}