README.md · espnet/owsm_ctc_v3.2_ft

metadata

tags:
  - espnet
  - audio
  - automatic-speech-recognition
  - speech-translation
  - language-identification
language: multilingual
datasets:
  - owsm_v3.2_ctc
license: cc-by-4.0

OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, Open Whisper-style Speech Model (OWSM).

This model is initialized with OWSM-CTC v3.1 and then fine-tuned on v3.2 data for 225k steps.

To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:

librosa
torch
espnet
espnet_model_zoo

We use FlashAttention during training, but we do not need it during inference. Please install it as follows:

pip install flash-attn --no-build-isolation

Example usage can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1