---
license: apache-2.0
language:
- multilingual
tags:
- automatic-speech-recognition
---

# reazonspeech-k2-v2-ja-en

`reazonspeech-k2-v2-ja-en` is an automatic speech recognition (ASR) model
trained on [ReazonSpeech v2.0 corpus](https://huggingface.co/datasets/reazon-research/reazonspeech) and [LibriSpeech](https://www.openslr.org/12/).

This model provides end-to-end Japanese and English speech recognition based on
[Next-gen Kaldi](https://k2-fsa.org/).

## Model Architecture

* Character-based RNN-T model.

* This model utilizes an enhanced Transformer architecture called
  [Zipformer](https://arxiv.org/abs/2310.11230).


## Usage

We recommend implementing this model by using the
[reazonspeech](https://github.com/reazon-research/reazonspeech)
library.

```
from reazonspeech.k2.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model(device="cpu", precision="fp32", language="ja-en") 
ret = transcribe(model, audio)
print(ret.text)
```

This model utilizes BBPE, so tokens for Japanese are represented by character sequences such as ▁ƊģŊ  
While time stamps are associated with each transcribed token, these tokens are encoded on the byte-level and cannot be directly understood.  
However, the English tokens are at a subword level printed in regular alphabetical text and can be directly understood.

## Performance

This model was validated post training with the following results.

Word Error Rates (WERs) listed below:

|       Datasets       | ReazonSpeech |  ReazonSpeech |     LibriSpeech    |    LibriSpeech    |
|----------------------|--------------|---------------|--------------------|-------------------|
|   Zipformer WER (%)  |     dev      |     test      |     test-clean     |    test-other     |
|     greedy_search    |     5.9      |     4.07      |        3.46        |       8.35        |
| modified_beam_search |    4.87      |     3.61      |        3.28        |       8.07        |


Character Error Rates (CERs) for Japanese listed below:
|   Decoding Method    | In-Distribution CER | JSUT | CommonVoice | TEDx  |
| :------------------: | :-----------------: | :--: | :---------: | :---: | 
|    greedy search     |        12.56        | 6.93 |    9.75     | 9.67  | 
| modified beam search |        11.59        | 6.97 |    9.55     | 9.51  |

Additional tests were performed with manually procurred audio files (see test_wavs/transcripts.txt).  
The model performs reasonably well as long as the input audio contains a single language.  
However when multiple languages are included in the same input, the model struggles to provide an accurate transcription (see test_multi).  
This result can be avoided by properly segmenting audio into chunks, separated by pauses in speech.

- test_ja_1: 57% (CER)
- test_ja_2: 26% (CER)
- test_multi: 99% (CER)
- test_en_1: 12% (WER)
- test_en_2: 27% (WER)


## License

[Apache Licence 2.0](https://choosealicense.com/licenses/apache-2.0/)