kotoba-tech
/

kotoba-whisper-v2.0

@@ -67,32 +67,35 @@ we employ OpenAI's [Whisper large-v3](https://huggingface.co/openai/whisper-larg
 teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
 Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
-As the initial version, we release ***kotoba-whisper-v1.0*** trained on the `large` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
 (the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
-which amounts 1,253 hours of audio with 16,861,235 characters of transcriptions (5 sec audio with 18 text tokens in average) after
 those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
-Kotoba-whisper-v1.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
 from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
 the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
 - ***CER***
-| Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
-|:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
-| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                   9.44 |        8.48 |         **12.60** |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                       **8.52** |            **7.18** |             15.18 |
-| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      11.34 |            9.87 |             29.56 |
-| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      15.26 |           14.22 |             34.29 |
-| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      46.86 |           35.69 |             96.69 |
 - ***WER***
-| Model                                                                                           | CommonVoice 8.0 (Japanese) | JSUT Basic 5000 | ReazonSpeech Test |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
-| [**kotoba-tech/kotoba-whisper-v1.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)   |                  59.27 |       64.36 |         **56.62** |
-| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                      **55.41** |           **59.34** |             60.23 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      63.64 |           69.52 |             76.04 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      74.21 |           82.02 |             82.99 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      93.78 |           97.72 |             94.85 |
@@ -103,7 +106,7 @@ it inherits the benefit of the improved latency compared to [openai/whisper-larg
 | Model                                                                                        | Params / M | Rel. Latency |
 |----------------------------------------------------------------------------------------------|------------|--------------|
-| **[kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)**| **756**    | **6.3**      |
 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                    | 1550       | 1.0          |
@@ -126,7 +129,7 @@ from transformers import pipeline
 from datasets import load_dataset
 # config
-model_id = "kotoba-tech/kotoba-whisper-v1.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
@@ -188,7 +191,7 @@ from transformers import pipeline
 from datasets import load_dataset
 # config
-model_id = "kotoba-tech/kotoba-whisper-v1.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
@@ -261,7 +264,7 @@ from evaluate import load
 from transformers.models.whisper.english_normalizer import BasicTextNormalizer
 # model config
-model_id = "kotoba-tech/kotoba-whisper-v1.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}

 teacher large-v3 model and the decoder with two layers initialized from the first and last layer of the large-v3 model.
 Kotoba-Whisper is **6.3x faster than large-v3**, while retaining as low error rate as the large-v3.
+As successor of our first model, [kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0), we release ***kotoba-whisper-v2.0*** trained on the `all` subset of [ReazonSpeech](https://huggingface.co/datasets/reazon-research/reazonspeech)
 (the largest speech-transcription paired dataset in Japanese extracted from Japanese TV audio recordings),
+which amounts 7,203,957 audio clips (5 sec audio with 18 text tokens in average) after
 those transcriptions more than 10 WER are removed (see [WER Filter](https://huggingface.co/distil-whisper/distil-large-v3#wer-filter) for detail).
 The model was trained for 8 epochs with batch size 256 with sampling rate of 16kHz, and the training and evaluation code to reproduce kotoba-whisper is available at [https://github.com/kotoba-tech/kotoba-whisper](https://github.com/kotoba-tech/kotoba-whisper).
+Kotoba-whisper-v2.0 achieves better CER and WER than the [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) in the in-domain held-out test set
 from ReazonSpeech, and achieves competitive CER and WER on the out-of-domain test sets including [JSUT basic 5000](https://sites.google.com/site/shinnosuketakamichi/publication/jsut) and
 the Japanese subset from [CommonVoice 8.0](https://huggingface.co/datasets/common_voice) (see [Evaluation](#evaluation) for detail).
 - ***CER***
+| Model                                                                                        |   [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) |   [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) |   [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
+|:---------------------------------------------------------------------------------------------|-------------------:|-----------------:|--------------------:|
+| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)|               9.20 |             8.40 |           **11.63** |
+| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)    |               9.44 |             8.48 |           **12.60** |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                    |           **8.52** |         **7.18** |               15.18 |
+| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                        |              11.34 |             9.87 |               29.56 |
+| [openai/whisper-small](https://huggingface.co/openai/whisper-small)                          |              15.26 |            14.22 |               34.29 |
+| [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                            |              46.86 |            35.69 |               96.69 |
 - ***WER***
+| Model                                                                                           |   [CommonVoice 8.0](https://huggingface.co/datasets/japanese-asr/ja_asr.common_voice_8_0) |   [JSUT basic5000](https://huggingface.co/datasets/japanese-asr/ja_asr.jsut_basic5000) |   [ReazonSpeech Test](https://huggingface.co/datasets/japanese-asr/ja_asr.reazonspeech_test) |
 |:------------------------------------------------------------------------------------------------|---------------------------:|----------------:|------------------:|
+| [**kotoba-tech/kotoba-whisper-v2.0**](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)   |                       58.8 |            63.7 |          **55.6** |
+| [kotoba-tech/kotoba-whisper-v1.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v1.0)       |                      59.27 |           64.36 |             56.62 |
+| [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                       |                  **55.41** |       **59.34** |             60.23 |
 | [openai/whisper-medium](https://huggingface.co/openai/whisper-medium)                           |                      63.64 |           69.52 |             76.04 |
 | [openai/whisper-small](https://huggingface.co/openai/whisper-small)                             |                      74.21 |           82.02 |             82.99 |
 | [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny)                               |                      93.78 |           97.72 |             94.85 |
 | Model                                                                                        | Params / M | Rel. Latency |
 |----------------------------------------------------------------------------------------------|------------|--------------|
+| **[kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0)**| **756**    | **6.3**      |
 | [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)                    | 1550       | 1.0          |
 from datasets import load_dataset
 # config
+model_id = "kotoba-tech/kotoba-whisper-v2.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 from datasets import load_dataset
 # config
+model_id = "kotoba-tech/kotoba-whisper-v2.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
 from transformers.models.whisper.english_normalizer import BasicTextNormalizer
 # model config
+model_id = "kotoba-tech/kotoba-whisper-v2.0"
 torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}