Update README.md (#12)
Browse files- Update README.md (8f88c5e3032c736a67d6ac696f21f3f25d2bc390)
- Update README.md (00acca2c94f2c6c8150d1cd1b858e6447b16102c)
- Update README.md (19f404348f18bbccdf8b9da95722d65cf6e27ddf)
Co-authored-by: Maha Elbayad <elbayadm@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,11 +1,116 @@
|
|
1 |
---
|
2 |
-
inference: false
|
3 |
-
tags:
|
4 |
-
- SeamlessM4T
|
5 |
license: cc-by-nc-4.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
library_name: fairseq2
|
7 |
---
|
8 |
-
|
9 |
# SeamlessM4T Medium
|
10 |
|
11 |
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
|
@@ -18,14 +123,15 @@ SeamlessM4T covers:
|
|
18 |
|
19 |
-------------------
|
20 |
|
21 |
-
**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large)
|
22 |
-
|
|
|
23 |
|
24 |
**SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
|
25 |
|
26 |
-------------------
|
27 |
|
28 |
-
This is the "medium" variant of
|
29 |
- Speech-to-speech translation (S2ST)
|
30 |
- Speech-to-text translation (S2TT)
|
31 |
- Text-to-speech translation (T2ST)
|
@@ -33,22 +139,23 @@ This is the "medium" variant of the unified model, which enables multiple tasks
|
|
33 |
- Automatic speech recognition (ASR)
|
34 |
|
35 |
## SeamlessM4T models
|
36 |
-
|
37 |
| Model Name | #params | checkpoint | metrics |
|
38 |
| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
|
39 |
-
| SeamlessM4T-Large
|
40 |
-
| SeamlessM4T-
|
|
|
41 |
|
42 |
-
We provide extensive evaluation results of SeamlessM4T
|
43 |
|
44 |
## 🤗 Transformers Usage
|
45 |
|
46 |
First, load the processor and a checkpoint of the model:
|
47 |
|
48 |
```python
|
49 |
-
|
50 |
-
|
51 |
-
|
|
|
52 |
```
|
53 |
|
54 |
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
@@ -56,110 +163,62 @@ We provide extensive evaluation results of SeamlessM4T-Medium and SeamlessM4T-La
|
|
56 |
Here is how to use the processor to process text and audio:
|
57 |
|
58 |
```python
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
>>> # now, process it
|
64 |
-
>>> audio_inputs = processor(audios=audio_sample["array"], return_tensors="pt")
|
65 |
-
>>> # now, process some English test as well
|
66 |
-
>>> text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
67 |
-
```
|
68 |
|
|
|
|
|
|
|
69 |
|
70 |
### Speech
|
71 |
|
72 |
-
|
73 |
|
74 |
```python
|
75 |
-
|
76 |
-
|
77 |
```
|
78 |
|
79 |
-
With basically the same code, I've translated English text and Arabic speech to Russian speech samples.
|
80 |
-
|
81 |
### Text
|
82 |
|
83 |
-
Similarly, you can generate translated text from audio files or from text with the same model.
|
84 |
-
|
85 |
|
86 |
```python
|
87 |
-
|
88 |
-
|
89 |
-
|
90 |
-
>>> # from text
|
91 |
-
>>> output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
92 |
-
>>> translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
93 |
-
```
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
## Instructions to run inference with SeamlessM4T models
|
98 |
-
|
99 |
-
The SeamlessM4T models are currently available through the `seamless_communication` package. The `seamless_communication`
|
100 |
-
package can be installed by following the instructions outlined here: [Installation](https://github.com/facebookresearch/seamless_communication/tree/main#installation).
|
101 |
-
|
102 |
-
Once installed, a [`Translator`](https://github.com/facebookresearch/seamless_communication/blob/590547965b343b590d15847a0aa25a6779fc3753/src/seamless_communication/models/inference/translator.py#L47)
|
103 |
-
object can be instantiated to perform all five of the spoken langauge tasks. The `Translator` is instantiated with three arguments:
|
104 |
-
1. **model_name_or_card**: SeamlessM4T checkpoint. Can be either `seamlessM4T_medium` for the medium model, or `seamlessM4T_large` for the large model
|
105 |
-
2. **vocoder_name_or_card**: vocoder checkpoint (`vocoder_36langs`)
|
106 |
-
3. **device**: Torch device
|
107 |
-
|
108 |
-
```python
|
109 |
-
import torch
|
110 |
-
from seamless_communication.models.inference import Translator
|
111 |
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
```
|
116 |
|
117 |
-
|
118 |
-
|
119 |
-
Given an input audio with `<path_to_input_audio>` or an input text `<input_text>` in `<src_lang>`, we can translate
|
120 |
-
into `<tgt_lang>` as follows.
|
121 |
-
|
122 |
-
### S2ST and T2ST:
|
123 |
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
# T2ST
|
129 |
-
translated_text, wav, sr = translator.predict(<input_text>, "t2st", <tgt_lang>, src_lang=<src_lang>)
|
130 |
```
|
|
|
|
|
|
|
|
|
131 |
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
torchaudio.save(
|
141 |
-
<path_to_save_audio>,
|
142 |
-
wav[0].cpu(),
|
143 |
-
sample_rate=sr,
|
144 |
)
|
145 |
```
|
146 |
|
147 |
-
### S2TT, T2TT and ASR:
|
148 |
-
|
149 |
-
```python
|
150 |
-
# S2TT
|
151 |
-
translated_text, _, _ = translator.predict(<path_to_input_audio>, "s2tt", <tgt_lang>)
|
152 |
-
|
153 |
-
# ASR
|
154 |
-
# This is equivalent to S2TT with `<tgt_lang>=<src_lang>`.
|
155 |
-
transcribed_text, _, _ = translator.predict(<path_to_input_audio>, "asr", <src_lang>)
|
156 |
-
|
157 |
-
# T2TT
|
158 |
-
translated_text, _, _ = translator.predict(<input_text>, "t2tt", <tgt_lang>, src_lang=<src_lang>)
|
159 |
-
|
160 |
-
```
|
161 |
-
Note that `<src_lang>` must be specified for T2TT.
|
162 |
-
|
163 |
## Citation
|
164 |
|
165 |
If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
|
|
|
1 |
---
|
|
|
|
|
|
|
2 |
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- af
|
5 |
+
- am
|
6 |
+
- ar
|
7 |
+
- as
|
8 |
+
- az
|
9 |
+
- be
|
10 |
+
- bn
|
11 |
+
- bs
|
12 |
+
- bg
|
13 |
+
- ca
|
14 |
+
- cs
|
15 |
+
- zh
|
16 |
+
- cy
|
17 |
+
- da
|
18 |
+
- de
|
19 |
+
- el
|
20 |
+
- en
|
21 |
+
- et
|
22 |
+
- fi
|
23 |
+
- fr
|
24 |
+
- or
|
25 |
+
- om
|
26 |
+
- ga
|
27 |
+
- gl
|
28 |
+
- gu
|
29 |
+
- ha
|
30 |
+
- he
|
31 |
+
- hi
|
32 |
+
- hr
|
33 |
+
- hu
|
34 |
+
- hy
|
35 |
+
- ig
|
36 |
+
- id
|
37 |
+
- is
|
38 |
+
- it
|
39 |
+
- jv
|
40 |
+
- ja
|
41 |
+
- kn
|
42 |
+
- ka
|
43 |
+
- kk
|
44 |
+
- mn
|
45 |
+
- km
|
46 |
+
- ky
|
47 |
+
- ko
|
48 |
+
- lo
|
49 |
+
- ln
|
50 |
+
- lt
|
51 |
+
- lb
|
52 |
+
- lg
|
53 |
+
- lv
|
54 |
+
- ml
|
55 |
+
- mr
|
56 |
+
- mk
|
57 |
+
- mt
|
58 |
+
- mi
|
59 |
+
- my
|
60 |
+
- nl
|
61 |
+
- nb
|
62 |
+
- ne
|
63 |
+
- ny
|
64 |
+
- oc
|
65 |
+
- pa
|
66 |
+
- ps
|
67 |
+
- fa
|
68 |
+
- pl
|
69 |
+
- pt
|
70 |
+
- ro
|
71 |
+
- ru
|
72 |
+
- sk
|
73 |
+
- sl
|
74 |
+
- sn
|
75 |
+
- sd
|
76 |
+
- so
|
77 |
+
- es
|
78 |
+
- sr
|
79 |
+
- sv
|
80 |
+
- sw
|
81 |
+
- ta
|
82 |
+
- te
|
83 |
+
- tg
|
84 |
+
- tl
|
85 |
+
- th
|
86 |
+
- tr
|
87 |
+
- uk
|
88 |
+
- ur
|
89 |
+
- uz
|
90 |
+
- vi
|
91 |
+
- wo
|
92 |
+
- xh
|
93 |
+
- yo
|
94 |
+
- ms
|
95 |
+
- zu
|
96 |
+
- ary
|
97 |
+
- arz
|
98 |
+
- yue
|
99 |
+
- kea
|
100 |
+
metrics:
|
101 |
+
- bleu
|
102 |
+
- wer
|
103 |
+
- chrf
|
104 |
+
inference: False
|
105 |
+
pipeline_tag: automatic-speech-recognition
|
106 |
+
tags:
|
107 |
+
- audio-to-audio
|
108 |
+
- text-to-speech
|
109 |
+
- speech-to-text
|
110 |
+
- text2text-generation
|
111 |
+
- seamless_communication
|
112 |
library_name: fairseq2
|
113 |
---
|
|
|
114 |
# SeamlessM4T Medium
|
115 |
|
116 |
SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
|
|
|
123 |
|
124 |
-------------------
|
125 |
|
126 |
+
**🌟 SeamlessM4T v2, an improved version of this version with a novel architecture, has been released [here](https://huggingface.co/facebook/seamless-m4t-v2-large).**
|
127 |
+
|
128 |
+
**This new model improves over SeamlessM4T v1 in quality as well as inference speed in speech generation tasks.**
|
129 |
|
130 |
**SeamlessM4T v2 is also supported by 🤗 Transformers, more on it [in the model card of this new version](https://huggingface.co/facebook/seamless-m4t-v2-large#transformers-usage) or directly in [🤗 Transformers docs](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t_v2).**
|
131 |
|
132 |
-------------------
|
133 |
|
134 |
+
This is the "medium" variant of SeamlessM4T, which enables multiple tasks without relying on multiple separate models:
|
135 |
- Speech-to-speech translation (S2ST)
|
136 |
- Speech-to-text translation (S2TT)
|
137 |
- Text-to-speech translation (T2ST)
|
|
|
139 |
- Automatic speech recognition (ASR)
|
140 |
|
141 |
## SeamlessM4T models
|
|
|
142 |
| Model Name | #params | checkpoint | metrics |
|
143 |
| ------------------ | ------- | --------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------ |
|
144 |
+
| [SeamlessM4T-Large v2](https://huggingface.co/facebook/seamless-m4t-v2-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-v2-large/blob/main/seamlessM4T_v2_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large_v2.zip) |
|
145 |
+
| [SeamlessM4T-Large (v1)](https://huggingface.co/facebook/seamless-m4t-large) | 2.3B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-large/blob/main/multitask_unity_large.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_large.zip) |
|
146 |
+
| [SeamlessM4T-Medium (v1)](https://huggingface.co/facebook/seamless-m4t-medium) | 1.2B | [checkpoint](https://huggingface.co/facebook/seamless-m4t-medium/blob/main/multitask_unity_medium.pt) | [metrics](https://dl.fbaipublicfiles.com/seamless/metrics/seamlessM4T_medium.zip) |
|
147 |
|
148 |
+
We provide extensive evaluation results of SeamlessM4T models in the [SeamlessM4T](https://arxiv.org/abs/2308.11596) and [Seamless](https://arxiv.org/abs/2312.05187) papers (as averages) in the `metrics` files above.
|
149 |
|
150 |
## 🤗 Transformers Usage
|
151 |
|
152 |
First, load the processor and a checkpoint of the model:
|
153 |
|
154 |
```python
|
155 |
+
import torchaudio
|
156 |
+
from transformers import AutoProcessor, SeamlessM4TModel
|
157 |
+
processor = AutoProcessor.from_pretrained("facebook/hf-seamless-m4t-medium")
|
158 |
+
model = SeamlessM4TModel.from_pretrained("facebook/hf-seamless-m4t-medium")
|
159 |
```
|
160 |
|
161 |
You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
|
|
|
163 |
Here is how to use the processor to process text and audio:
|
164 |
|
165 |
```python
|
166 |
+
# Read an audio file and resample to 16kHz:
|
167 |
+
audio, orig_freq = torchaudio.load("https://www2.cs.uic.edu/~i101/SoundFiles/preamble10.wav")
|
168 |
+
audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
|
169 |
+
audio_inputs = processor(audios=audio, return_tensors="pt")
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
+
# Process some input text as well:
|
172 |
+
text_inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
|
173 |
+
```
|
174 |
|
175 |
### Speech
|
176 |
|
177 |
+
Generate speech in Russian from either text (T2ST) or speech input (S2ST):
|
178 |
|
179 |
```python
|
180 |
+
audio_array_from_text = model.generate(**text_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
181 |
+
audio_array_from_audio = model.generate(**audio_inputs, tgt_lang="rus")[0].cpu().numpy().squeeze()
|
182 |
```
|
183 |
|
|
|
|
|
184 |
### Text
|
185 |
|
186 |
+
Similarly, you can generate translated text from audio files (S2TT) or from text (T2TT, conventionally MT) with the same model.
|
187 |
+
You only have to pass `generate_speech=False` to [`SeamlessM4TModel.generate`](https://huggingface.co/docs/transformers/main/en/model_doc/seamless_m4t#transformers.SeamlessM4TModel.generate).
|
188 |
|
189 |
```python
|
190 |
+
# from audio
|
191 |
+
output_tokens = model.generate(**audio_inputs, tgt_lang="fra", generate_speech=False)
|
192 |
+
translated_text_from_audio = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
193 |
|
194 |
+
# from text
|
195 |
+
output_tokens = model.generate(**text_inputs, tgt_lang="fra", generate_speech=False)
|
196 |
+
translated_text_from_text = processor.decode(output_tokens[0].tolist()[0], skip_special_tokens=True)
|
197 |
```
|
198 |
|
199 |
+
## Seamless_communication
|
200 |
+
You can also use the seamlessM4T models using the [`seamless_communication` library](https://github.com/facebookresearch/seamless_communication/blob/main/docs/m4t/README.md)
|
|
|
|
|
|
|
|
|
201 |
|
202 |
+
with either CLI:
|
203 |
+
```bash
|
204 |
+
m4t_predict <path_to_input_audio> --task s2st --tgt_lang <tgt_lang> --output_path <path_to_save_audio> --model_name seamlessM4T_medium
|
|
|
|
|
|
|
205 |
```
|
206 |
+
or a `Translator` API:
|
207 |
+
```py
|
208 |
+
import torch
|
209 |
+
from seamless_communication.inference import Translator
|
210 |
|
211 |
+
# Initialize a Translator object with a multitask model, vocoder on the GPU.
|
212 |
+
translator = Translator("seamlessM4T_medium", "vocoder_36langs", torch.device("cuda:0"), torch.float16)
|
213 |
+
text_output, speech_output = translator.predict(
|
214 |
+
input=<path_to_input_audio>,
|
215 |
+
task_str="S2ST",
|
216 |
+
tgt_lang=<tgt_lang>,
|
217 |
+
text_generation_opts=text_generation_opts,
|
218 |
+
unit_generation_opts=unit_generation_opts
|
|
|
|
|
|
|
|
|
219 |
)
|
220 |
```
|
221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
222 |
## Citation
|
223 |
|
224 |
If you plan to use SeamlessM4T in your work or any models/datasets/artifacts published in SeamlessM4T, please cite:
|