File size: 4,566 Bytes
e242739
 
 
 
 
 
 
3ce9234
 
 
f29f85e
3ce9234
 
 
 
 
 
 
e242739
 
3ce9234
 
 
 
 
 
 
 
 
 
 
 
f29f85e
 
3ce9234
 
54f0bb6
3ce9234
54f0bb6
3ce9234
 
 
 
e242739
 
 
 
 
d54c10d
 
3ce9234
d54c10d
 
 
 
 
 
3ce9234
 
 
 
 
 
 
 
 
 
 
f500ead
 
 
 
 
 
 
 
3ce9234
 
f500ead
 
 
 
 
 
 
3ce9234
 
f500ead
 
 
 
 
 
 
3ce9234
 
f500ead
 
 
 
 
 
 
3ce9234
 
 
 
 
 
 
54f0bb6
 
3ce9234
 
54f0bb6
3ce9234
54f0bb6
 
 
 
3ce9234
 
 
 
 
 
54f0bb6
3ce9234
 
54f0bb6
 
 
3ce9234
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e242739
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
license: mit
language:
- en
metrics:
- wer
pipeline_tag: automatic-speech-recognition
datasets:
- MuST-C-en_ar
tags:
- audio
- automatic-speech-recognition
- speech
- speech2text
- ASR
- asr
- ASR-punctuation-sensitive
- encoder-decoder-for-asr
---

# speecht5-asr-punctuation-sensitive
This model is part of SotoMedia's Automatic Video Dubbing project, aiming to build first open source video dubbing technolgy across a 
diverse range of languages. You can find more details about our project and our pibline [here](https://github.com/ElsebaiyMohamed/Modablag).

## Description:
The **speecht5-asr-punctuation-sensitive** model is an advanced Automatic Speech Recognition (ASR) system designed to transcribe spoken English 
while maintaining a high level of awareness for punctuation. This model is trained to accurately recognize and preserve punctuation marks,
enhancing the fidelity of transcriptions in scenarios where punctuation is crucial for conveying meaning. 

- **Model type:** transformer encoder- decoder
- **Language:** En
- **Base model:** SpeechT5-ASR [checkpoint](https://huggingface.co/microsoft/speecht5_asr)
- **
Finetuning dataset:** [MuST-C-en_ar](https://www.kaggle.com/datasets/sebaeymohamed/must-c-en-ar) 

## Key Features:
**Punctuation Sensitivity:** The model is specifically engineered to be highly sensitive to punctuation nuances in spoken English, ensuring 
accurate representation of the speaker's intended meaning.
**New Vocabulary:** Change vocabulary to be on Piece-level rather than character-level with vocabulary size 500 piece.


## Usage

```py
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("automatic-speech-recognition", model="seba3y/speecht5-asr-punctuation-sensitive")
```


```py
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

processor = AutoProcessor.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")
model = AutoModelForSpeechSeq2Seq.from_pretrained("seba3y/speecht5-asr-punctuation-sensitive")
```

## Fintuning & Evaluation Details

### Dataset
MuST-C is a multilingual speech translation corpus whose size and quality will facilitate the training of end-to-end systems 
for SLT from English into several target languages. For each target language, MuST-C comprises several hundred hours of audio 
recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

**Datasplits:**
- set: dev
| | |
|-|-|
|talks| 11|
|sentences|1073|
|words src|24274|
|words tgt|21387|
|time|2h28m34s|


- set: tst-COMMON
| | |
|-|-|
|talks| 27|
|sentences|2019|
|words src|41955|
|words tgt|36443|
|time|4h04m39s|

- set: tst-HE
| | |
|-|-|
|talks| 12|
|sentences|578|
|words src|13080|
|words tgt|10912|
|time|1h26m51s|

- set: train
| | |
|-|-|
|talks| 2412|
|sentences|212085|
|words src|4520522|
|words tgt|4000457|
|time|463h15m44s|


#### Hyperparameters

|Paramter|Value|
|-|-|
|per_device_train_batch_size|6|
|per_device_eval_batch_size|16|
|gradient_accumulation_steps|12|
|eval_accumulation_steps|16|
|dataloader_num_workers|2|
|learning_rate|5e-5|
|adafactor|True|
|weight_decay|0.08989525|
|max_grad_norm|0.58585|
|num_train_epochs|5|
|warmup_ratio|0.7|
|lr_scheduler_type|constant_with_warmup|
|fp16|True|
|gradient_checkpointing|True|
|sortish_sampler|True|

##### Results
**Train loss:** 0.8925
|Split|Word Error Rate (%)|
|-|-|
|dev|44.8|
|tst-HE|39.1|
|tst-COMMON|43.2|


## Citation

- MuST-C dataset
```
@InProceedings{mustc19, author = "Di Gangi, Mattia Antonino and Cattoni, Roldano and Bentivogli, Luisa and Negri, Matteo > and Turchi, Marco",
 title = "{MuST-C: a Multilingual Speech Translation Corpus}",
booktitle = "Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,
 Volume 2 (Short Papers)", year = "2019", address = "Minneapolis, MN, USA", month = "June"}}
```
- SpeechT5-ASR
```
@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}

```