--- license: apache-2.0 language: - multilingual tags: - automatic-speech-recognition --- # reazonspeech-k2-v2-ja-en `reazonspeech-k2-v2-ja-en` is an automatic speech recognition (ASR) model trained on [ReazonSpeech v2.0 corpus](https://huggingface.co/datasets/reazon-research/reazonspeech) and [LibriSpeech](https://www.openslr.org/12/). This model provides end-to-end Japanese and English speech recognition based on [Next-gen Kaldi](https://k2-fsa.org/). ## Model Architecture * Character-based RNN-T model. * This model utilizes an enhanced Transformer architecture called [Zipformer](https://arxiv.org/abs/2310.11230). ## Usage We recommend implementing this model by using the [reazonspeech](https://github.com/reazon-research/reazonspeech) library. ``` from reazonspeech.k2.asr import load_model, transcribe, audio_from_path audio = audio_from_path("speech.wav") model = load_model(device="cpu", precision="fp32", language="ja-en") ret = transcribe(model, audio) print(ret.text) ``` This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood. However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood. ## Performance This model was validated post training with the following results. Word Error Rates (WERs) listed below: | Datasets | ReazonSpeech | ReazonSpeech | LibriSpeech | LibriSpeech | |----------------------|--------------|---------------|--------------------|-------------------| | Zipformer WER (%) | dev | test | test-clean | test-other | | greedy_search | 5.9 | 4.07 | 3.46 | 8.35 | | modified_beam_search | 4.87 | 3.61 | 3.28 | 8.07 | Character Error Rates (CERs) for Japanese listed below: | Decoding Method | In-Distribution CER | JSUT | CommonVoice | TEDx | | :------------------: | :-----------------: | :--: | :---------: | :---: | | greedy search | 12.56 | 6.93 | 9.75 | 9.67 | | modified beam search | 11.59 | 6.97 | 9.55 | 9.51 | Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt). The model performs reasonably well as long as the input audio contains a single language. However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi). This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech. - test_ja_1: 57% (CER) - test_ja_2: 26% (CER) - test_multi: 99% (CER) - test_en_1: 12% (WER) - test_en_2: 27% (WER) ## License [Apache Licence 2.0](https://choosealicense.com/licenses/apache-2.0/)