File size: 4,122 Bytes
3b3dddc
 
 
 
 
 
 
 
 
 
 
 
 
f880d37
 
9e55bb0
e4d3e3d
24122a4
 
90c1868
24122a4
 
 
e4d3e3d
 
 
 
 
 
 
 
 
 
 
 
 
437a576
 
24122a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
437a576
 
24122a4
 
 
 
 
 
e2e75ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24122a4
 
437a576
 
 
24122a4
 
 
 
 
 
 
e2e75ad
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24122a4
 
 
 
e2e75ad
 
24122a4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
tags:
- espnet
- audio
- automatic-speech-recognition
- speech-translation
- language-identification
language: multilingual
datasets:
- owsm_v3.1_ctc
license: cc-by-4.0
---

[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://arxiv.org/abs/2401.16658).

Due to time constraint, the model used in the paper was trained for 40 "epochs". The new model trained for 45 "epochs" (approximately three entire passes on the full data) is also added in this repo in order to match the setup of encoder-decoder OWSM. It can have better performance than the old one in many test sets.

Currently, the code for OWSM-CTC has not been merged into ESPnet main branch. Instead, it is available as follows:
- PR in ESPnet: https://github.com/espnet/espnet/pull/5933
- Code in my repo: https://github.com/pyf98/espnet/tree/owsm-ctc
- Current model on HF: https://huggingface.co/pyf98/owsm_ctc_v3.1_1B

To use the pre-trained model, you need to install `espnet` and `espnet_model_zoo`. The requirements are:
```
librosa
torch
espnet @ git+https://github.com/pyf98/espnet@owsm-ctc
espnet_model_zoo
```

We use FlashAttention during training, but we do not need it during inference. Please install it as follows:
```bash
pip install flash-attn --no-build-isolation
```

### Example script for short-form ASR/ST

```python
import soundfile as sf
import numpy as np
import librosa
import kaldiio
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch


s2t = Speech2TextGreedySearch.from_pretrained(
    "pyf98/owsm_ctc_v3.1_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)
```

### Example script for long-form ASR/ST

```python
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch


context_len_in_secs = 4   # left and right context when doing buffered inference
batch_size = 32   # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
    "pyf98/owsm_ctc_v3.1_1B",
    device='cuda' if torch.cuda.is_available() else 'cpu',
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)

text = s2t.decode_long_batched_buffered(
    speech,
    batch_size=batch_size,
    context_len_in_secs=context_len_in_secs,
    frames_per_sec=12.5,        # 80ms shift, model-dependent, don't change
)
print(text)
```

### Example for CTC forced alignment using `ctc-segmentation`

It can be efficiently applied to audio of an arbitrary length.
For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo

```python
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation


## Please download model first
aligner = CTCSegmentation(
    s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
    fs=16000,
    ngpu=1,
    batch_size=16,    # batched parallel decoding; reduce it if your GPU memory is smaller
    kaldi_style_text=True,
    time_stamps="fixed",
    samples_to_frames_ratio=1280,   # 80ms time shift; don't change as it depends on the pre-trained model
    lang_sym="<eng>",
    task_sym="<asr>",
    context_len_in_secs=2,  # left and right context in buffered decoding
    frames_per_sec=12.5,    # 80ms time shift; don't change as it depends on the pre-trained model
)

speech, rate = sf.read(
    "example.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = '''
utt1 hello there
utt2 welcome to this repo
'''

segments = aligner(speech, text)
print(segments)
```