fdschmidt93 commited on
Commit
b4b0962
1 Parent(s): ee8c043

docs: add README

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bn
11
+ - bs
12
+ - bg
13
+ - ca
14
+ - cs
15
+ - zh
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - et
22
+ - fi
23
+ - fr
24
+ - or
25
+ - om
26
+ - ga
27
+ - gl
28
+ - gu
29
+ - ha
30
+ - he
31
+ - hi
32
+ - hr
33
+ - hu
34
+ - hy
35
+ - ig
36
+ - id
37
+ - is
38
+ - it
39
+ - jv
40
+ - ja
41
+ - kn
42
+ - ka
43
+ - kk
44
+ - mn
45
+ - km
46
+ - ky
47
+ - ko
48
+ - lo
49
+ - ln
50
+ - lt
51
+ - lb
52
+ - lg
53
+ - lv
54
+ - ml
55
+ - mr
56
+ - mk
57
+ - mt
58
+ - mi
59
+ - my
60
+ - nl
61
+ - nb
62
+ - ne
63
+ - ny
64
+ - oc
65
+ - pa
66
+ - ps
67
+ - fa
68
+ - pl
69
+ - pt
70
+ - ro
71
+ - ru
72
+ - sk
73
+ - sl
74
+ - sn
75
+ - sd
76
+ - so
77
+ - es
78
+ - sr
79
+ - sv
80
+ - sw
81
+ - ta
82
+ - te
83
+ - tg
84
+ - tl
85
+ - th
86
+ - tr
87
+ - uk
88
+ - ur
89
+ - uz
90
+ - vi
91
+ - wo
92
+ - xh
93
+ - yo
94
+ - ms
95
+ - zu
96
+ - ary
97
+ - arz
98
+ - yue
99
+ - kea
100
+ tags:
101
+ - audio-to-audio
102
+ - text-to-speech
103
+ multilinguality:
104
+ - multilingual
105
+ task_categories:
106
+ - audio-classification
107
+ library_name: transformers
108
+ pretty_name: SeamlessM4Tv2-Large Speech Encoder
109
+ ---
110
+
111
+ # SeamlessM4Tv2-Large Speech Encoder
112
+
113
+ This repository carves out the speech encoder from [SeamlessM4Tv2-Large](facebook/seamless-m4t-v2-large), which performs strongly on cross- and multilingual sequence-level audio classification tasks (cf. results on SIB-Fleurs available [here](https://huggingface.co/datasets/WueNLP/sib-fleurs#asr-results)\).
114
+
115
+ All credits go to the original SeamlessM4Tv2-Large Team.
116
+
117
+ ## Example Usage
118
+
119
+ You can use both `AutoModel` and `AutoModelForAudioClassification` (or `AutoModelForSequenceClassification`, if you prefer) with this repository:
120
+
121
+ ```python
122
+ # best to use both feature extractor and model with GPU!
123
+ from datasets import load_dataset
124
+ from transformers import (
125
+ AutoModel,
126
+ AutoModelForAudioClassification,
127
+ AutoFeatureExtractor,
128
+ )
129
+ import torch
130
+ import torchaudio
131
+
132
+ device = "cuda:0"
133
+
134
+ feature_extractor = AutoFeatureExtractor.from_pretrained(
135
+ "WueNLP/seamless-m4t-v2-large-speech-encoder", trust_remote_code=True
136
+ )
137
+ model = AutoModel.from_pretrained(
138
+ "WueNLP/seamless-m4t-v2-large-speech-encoder",
139
+ trust_remote_code=True,
140
+ torch_dtype=torch.bfloat16,
141
+ ).to(device)
142
+
143
+ audio, orig_freq = torchaudio.load(
144
+ "https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav"
145
+ )
146
+ audio = torchaudio.functional.resample(
147
+ audio, orig_freq=orig_freq, new_freq=16_000
148
+ ) # must be a 16 kHz waveform array
149
+ # return_attention_mask=True for batching
150
+ audio_inputs = feature_extractor(audio, return_attention_mask=True, return_tensors="pt", device=device)
151
+ audio_inputs = audio_inputs.to(device)
152
+ with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
153
+ audio_hidden_states = model(**audio_inputs)[0].detach().cpu().numpy().squeeze()
154
+
155
+
156
+ # instantiate a model for AudioClassification
157
+ model = AutoModelForAudioClassification.from_pretrained(
158
+ "WueNLP/seamless-m4t-v2-large-speech-encoder",
159
+ trust_remote_code=True,
160
+ torch_dtype=torch.bfloat16,
161
+ # SIB-Fleurs has 7 labels
162
+ num_labels=7,
163
+ ).to(device)
164
+ eng_Latn = load_dataset("wuenlp/sib-fleurs", "eng_Latn", split="train")
165
+ examples = [eng_Latn[i] for i in range(5)]
166
+ labels = torch.LongTensor([example["category"] for example in examples]).to(device)
167
+ batch = feature_extractor(
168
+ # [0] indexing here since there typically are multiple utterances per instance, we just ignore those
169
+ [example["audio"][0]["array"] for example in examples],
170
+ sampling_rate=16000,
171
+ device=device,
172
+ return_attention_mask=True,
173
+ return_tensors="pt",
174
+ ).to(device)
175
+ batch["labels"] = labels
176
+ with torch.autocast(dtype=torch.bfloat16, device_type="cuda"):
177
+ # outputs comprises loss & logits
178
+ outputs = model(**batch)
179
+ ```
180
+
181
+ ## Citation
182
+
183
+ Should you be using this model, please cite the original SeamlessM4Tv2 paper.
184
+
185
+ ```
186
+ @misc{communication2023seamlessmultilingualexpressivestreaming,
187
+ title={Seamless: Multilingual Expressive and Streaming Speech Translation},
188
+ author={Seamless Communication and Loïc Barrault and Yu-An Chung and Mariano Coria Meglioli and David Dale and Ning Dong and Mark Duppenthaler and Paul-Ambroise Duquenne and Brian Ellis and Hady Elsahar and Justin Haaheim and John Hoffman and Min-Jae Hwang and Hirofumi Inaguma and Christopher Klaiber and Ilia Kulikov and Pengwei Li and Daniel Licht and Jean Maillard and Ruslan Mavlyutov and Alice Rakotoarison and Kaushik Ram Sadagopan and Abinesh Ramakrishnan and Tuan Tran and Guillaume Wenzek and Yilin Yang and Ethan Ye and Ivan Evtimov and Pierre Fernandez and Cynthia Gao and Prangthip Hansanti and Elahe Kalbassi and Amanda Kallet and Artyom Kozhevnikov and Gabriel Mejia Gonzalez and Robin San Roman and Christophe Touret and Corinne Wong and Carleigh Wood and Bokai Yu and Pierre Andrews and Can Balioglu and Peng-Jen Chen and Marta R. Costa-jussà and Maha Elbayad and Hongyu Gong and Francisco Guzmán and Kevin Heffernan and Somya Jain and Justine Kao and Ann Lee and Xutai Ma and Alex Mourachko and Benjamin Peloquin and Juan Pino and Sravya Popuri and Christophe Ropers and Safiyyah Saleem and Holger Schwenk and Anna Sun and Paden Tomasello and Changhan Wang and Jeff Wang and Skyler Wang and Mary Williamson},
189
+ year={2023},
190
+ eprint={2312.05187},
191
+ archivePrefix={arXiv},
192
+ primaryClass={cs.CL},
193
+ url={https://arxiv.org/abs/2312.05187},
194
+ }
195
+ ```