OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, Open Whisper-style Speech Model (OWSM).

This model is initialized with OWSM-CTC v3.1 and then fine-tuned on v3.2 data for 225k steps.

To use the pre-trained model, please install espnet and espnet_model_zoo. The requirements are:

librosa
torch
espnet
espnet_model_zoo

We use FlashAttention during training, but we do not need it during inference. Please install it as follows:

pip install flash-attn --no-build-isolation

Example usage can be found in ESPnet: https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1

Downloads last month
31
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using espnet/owsm_ctc_v3.2_ft_1B 2