Yifan Peng commited on
Commit
b0add0d
1 Parent(s): 37066b8
Files changed (27) hide show
  1. README.md +132 -0
  2. data/token_list/bpe_unigram50000/bpe.model +3 -0
  3. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/RESULTS.md +0 -0
  4. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/config.yaml +0 -0
  5. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/backward_time.png +0 -0
  6. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_ctc.png +0 -0
  7. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer12.png +0 -0
  8. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer15.png +0 -0
  9. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer21.png +0 -0
  10. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer6.png +0 -0
  11. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/clip.png +0 -0
  12. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/forward_time.png +0 -0
  13. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/gpu_max_cached_mem_GB.png +0 -0
  14. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/grad_norm.png +0 -0
  15. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/iter_time.png +0 -0
  16. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss.png +0 -0
  17. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_ctc.png +0 -0
  18. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer12.png +0 -0
  19. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer15.png +0 -0
  20. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer21.png +0 -0
  21. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer6.png +0 -0
  22. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_scale.png +0 -0
  23. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/optim0_lr0.png +0 -0
  24. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/optim_step_time.png +0 -0
  25. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/train_time.png +0 -0
  26. exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/valid.total_count.ave_5best.pth +3 -0
  27. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - automatic-speech-recognition
6
+ - speech-translation
7
+ - language-identification
8
+ language: multilingual
9
+ datasets:
10
+ - owsm_v3.2_ctc
11
+ license: cc-by-4.0
12
+ ---
13
+
14
+ [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
15
+ It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://arxiv.org/abs/2401.16658).
16
+
17
+ This model is initialized with [OWSM-CTC v3.1](https://huggingface.co/pyf98/owsm_ctc_v3.1_1B) and then fine-tuned on [v3.2 data](https://arxiv.org/abs/2406.09282) for 225k steps.
18
+
19
+ Currently, the code for OWSM-CTC has not been merged into ESPnet main branch. Instead, it is available as follows:
20
+ - Code in my repo: https://github.com/pyf98/espnet/tree/owsm-ctc
21
+ - Current model on HF: https://huggingface.co/pyf98/owsm_ctc_v3.1_1B
22
+
23
+ To use the pre-trained model, you need to install `espnet` and `espnet_model_zoo`. The requirements are:
24
+ ```
25
+ librosa
26
+ torch
27
+ espnet @ git+https://github.com/pyf98/espnet@owsm-ctc
28
+ espnet_model_zoo
29
+ ```
30
+
31
+ We use FlashAttention during training, but we do not need it during inference. Please install it as follows:
32
+ ```bash
33
+ pip install flash-attn --no-build-isolation
34
+ ```
35
+
36
+ ### Example script for short-form ASR/ST
37
+
38
+ ```python
39
+ import soundfile as sf
40
+ import numpy as np
41
+ import librosa
42
+ import kaldiio
43
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
44
+
45
+
46
+ s2t = Speech2TextGreedySearch.from_pretrained(
47
+ "pyf98/owsm_ctc_v3.2_ft_1B",
48
+ device="cuda",
49
+ generate_interctc_outputs=False,
50
+ lang_sym='<eng>',
51
+ task_sym='<asr>',
52
+ )
53
+
54
+ speech, rate = sf.read(
55
+ "xxx.wav"
56
+ )
57
+ speech = librosa.util.fix_length(speech, size=(16000 * 30))
58
+
59
+ res = s2t(speech)[0]
60
+ print(res)
61
+ ```
62
+
63
+ ### Example script for long-form ASR/ST
64
+
65
+ ```python
66
+ import soundfile as sf
67
+ import torch
68
+ from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
69
+
70
+
71
+ if __name__ == "__main__":
72
+ context_len_in_secs = 4 # left and right context when doing buffered inference
73
+ batch_size = 32 # depends on the GPU memory
74
+ s2t = Speech2TextGreedySearch.from_pretrained(
75
+ "pyf98/owsm_ctc_v3.2_ft_1B",
76
+ device='cuda' if torch.cuda.is_available() else 'cpu',
77
+ generate_interctc_outputs=False,
78
+ lang_sym='<eng>',
79
+ task_sym='<asr>',
80
+ )
81
+
82
+ speech, rate = sf.read(
83
+ "xxx.wav"
84
+ )
85
+
86
+ text = s2t.decode_long_batched_buffered(
87
+ speech,
88
+ batch_size=batch_size,
89
+ context_len_in_secs=context_len_in_secs,
90
+ frames_per_sec=12.5, # 80ms shift, model-dependent, don't change
91
+ )
92
+ print(text)
93
+ ```
94
+
95
+ ### Example for CTC forced alignment using `ctc-segmentation`
96
+
97
+ It can be efficiently applied to audio of an arbitrary length.
98
+ For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo
99
+
100
+ ```python
101
+ import soundfile as sf
102
+ from espnet2.bin.s2t_ctc_align import CTCSegmentation
103
+
104
+
105
+ if __name__ == "__main__":
106
+ ## Please download model first
107
+ aligner = CTCSegmentation(
108
+ s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
109
+ fs=16000,
110
+ ngpu=1,
111
+ batch_size=16, # batched parallel decoding; reduce it if your GPU memory is smaller
112
+ kaldi_style_text=True,
113
+ time_stamps="fixed",
114
+ samples_to_frames_ratio=1280, # 80ms time shift; don't change as it depends on the pre-trained model
115
+ lang_sym="<eng>",
116
+ task_sym="<asr>",
117
+ context_len_in_secs=2, # left and right context in buffered decoding
118
+ frames_per_sec=12.5, # 80ms time shift; don't change as it depends on the pre-trained model
119
+ )
120
+
121
+ speech, rate = sf.read(
122
+ "example.wav"
123
+ )
124
+ print(f"speech duration: {len(speech) / rate : .2f} seconds")
125
+ text = '''
126
+ utt1 hello there
127
+ utt2 welcome to this repo
128
+ '''
129
+
130
+ segments = aligner(speech, text)
131
+ print(segments)
132
+ ```
data/token_list/bpe_unigram50000/bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff9e37e2ec3b9c6cd1a2b02672b40a17b8bc2e11ad865a44518835a199dfd890
3
+ size 1031801
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/RESULTS.md ADDED
File without changes
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/backward_time.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_ctc.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer12.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer15.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer21.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/cer_interctc_layer6.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/clip.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/forward_time.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/gpu_max_cached_mem_GB.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/grad_norm.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/iter_time.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_ctc.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer12.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer15.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer21.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_interctc_layer6.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/loss_scale.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/optim0_lr0.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/optim_step_time.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/images/train_time.png ADDED
exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/valid.total_count.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:17d91370a0d2f10f342a2f10fbad4acd830640beeb838751126addb22d082e9a
3
+ size 4020755920
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202310'
2
+ files:
3
+ s2t_model_file: exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/valid.total_count.ave_5best.pth
4
+ python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:38:53) [GCC 12.3.0]
5
+ timestamp: 1727202259.438145
6
+ torch: 2.5.0.dev20240829+cu124
7
+ yaml_files:
8
+ s2t_train_config: exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_init3.1_raw_bpe50000/config.yaml