Text-to-Speech
Transformers
Safetensors
English
parler_tts
text2text-generation
annotation
sanchit-gandhi HF staff commited on
Commit
01fcd6b
1 Parent(s): 30a7503

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +316 -150
README.md CHANGED
@@ -1,199 +1,365 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
 
 
13
 
14
- ### Model Description
 
 
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
35
 
36
- ## Uses
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
39
 
40
- ### Direct Use
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
43
 
44
- [More Information Needed]
 
 
 
 
45
 
46
- ### Downstream Use [optional]
 
 
 
 
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
 
51
 
52
- ### Out-of-Scope Use
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
61
 
62
- [More Information Needed]
 
63
 
64
- ### Recommendations
 
 
 
 
 
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
69
 
70
- ## How to Get Started with the Model
 
71
 
72
- Use the code below to get started with the model.
 
 
73
 
74
- [More Information Needed]
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
 
 
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
 
 
87
 
88
- #### Preprocessing [optional]
 
 
 
 
 
 
 
 
 
89
 
90
- [More Information Needed]
91
 
 
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
96
 
97
- #### Speeds, Sizes, Times [optional]
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - text-to-speech
5
+ - annotation
6
+ license: apache-2.0
7
+ language:
8
+ - en
9
+ pipeline_tag: text-to-speech
10
+ inference: false
11
+ datasets:
12
+ - ylacombe/expresso
13
+ - reach-vb/jenny_tts_dataset
14
+ - blabble-io/libritts_r
15
  ---
16
 
17
+ <img src="https://huggingface.co/datasets/parler-tts/images/resolve/main/thumbnail.png" alt="Parler Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
18
 
 
19
 
20
+ # Parler-TTS Mini: Expresso v0.1
21
 
22
+ TODO: update link to space
23
 
24
+ <a target="_blank" href="https://huggingface.co/spaces/parler-tts/parler_tts_mini_expresso">
25
+ <img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
26
+ </a>
27
 
28
+ **Parler-TTS Mini: Expresso v0.1** is a fine-tuned version of [Parler-TTS Mini v0.1](https://huggingface.co/parler-tts/parler_tts_mini_v0.1)
29
+ on the [Expresso](https://huggingface.co/datasets/ylacombe/expresso) dataset. It is a lightweight text-to-speech (TTS)
30
+ model that can generate high-quality, natural sounding speech. Compared to the original model, Expresso v0.1 provides
31
+ superior control over **emotions** (happy, confused, laughing, sad) and **consistent voices** (Jerry, Thomas, Elisabeth, Talia).
32
 
33
+ It is part of the first release from the [Parler-TTS](https://github.com/huggingface/parler-tts) project, which aims to
34
+ provide the community with TTS training resources and dataset pre-processing code. Details for reproducing this entire
35
+ training run are provided in the section [Training Procedure](#training-procedure).
36
 
37
+ ## Usage
38
 
39
+ Using Expresso v0.1 is as simple as "bonjour". Simply install the library from source:
 
 
 
 
 
 
40
 
41
+ ```sh
42
+ pip install git+https://github.com/huggingface/parler-tts.git
43
+ ```
44
 
45
+ You can then use the model with the following inference snippet:
46
 
47
+ ```py
48
+ import torch
49
+ from parler_tts import ParlerTTSForConditionalGeneration
50
+ from transformers import AutoTokenizer, set_seed
51
+ import soundfile as sf
52
 
53
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
54
 
55
+ model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-expresso").to(device)
56
+ tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-mini-expresso")
57
 
58
+ prompt = "My name's Thomas, one of four voices this model can produce."
59
+ description = "Thomas speaks moderately slowly in a happy tone with high quality audio."
60
 
61
+ input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
62
+ prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
63
 
64
+ set_seed(42)
65
+ generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
66
+ audio_arr = generation.cpu().numpy().squeeze()
67
+ sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
68
+ ```
69
 
70
+ **Tips**:
71
+ * Specify the name of a male speaker (Jerry, Thomas) or female speaker (Talia, Elisabeth) for consistent voices
72
+ * The model can generate in a range of emotions, including: "happy", "confused", "default" (meaning no particular emotion conveyed), "laughing", "sad", "whisper", "emphasis"
73
+ * Include the term "high quality audio" to generate the highest quality audio, and "very noisy audio" for high levels of background noise
74
+ * Punctuation can be used to control the prosody of the generations, e.g. use commas to add small breaks in speech
75
+ * Wrap words in asterisk to emphasise them (e.g. `*you*` in the example above)
76
 
77
+ ## Training Procedure
78
 
79
+ Expresso is a high-quality, expressive speech dataset that includes samples from four speakers (two male, two female).
80
+ By fine-tuning Parler-TTS Mini v0.1 on this dataset, we can train the model to follow emotion and speaker prompts.
81
 
82
+ To reproduce this fine-tuning run, we need to perform two steps:
83
+ 1. Create text descriptions from the audio samples in the Expresso dataset
84
+ 2. Train the model on the (text, audio) pairs
85
 
86
+ Step 1 is performed using the [DataSpeech](https://github.com/huggingface/dataspeech) library, and step 2 using
87
+ [Parler-TTS](https://github.com/huggingface/parler-tts). Should you wish to use the pre-annotated dataset from our
88
+ experiments, you can jump straight to [step 2](#step-2--fine-tune-the-model). For both, you can follow step 0 for
89
+ getting set-up.
90
 
91
+ ### Step 0: Set-Up
92
 
93
+ We'll start by creating a fresh Python environment:
94
 
95
+ ```sh
96
+ python3 -m venv parler-env
97
+ source parler-env/bin/activate
98
+ ```
99
 
100
+ Next, install PyTorch according to the [official instructions](https://pytorch.org/get-started/locally/). We can then
101
+ install DataSpeech and Parler-TTS sequentially:
102
 
103
+ ```sh
104
+ git clone git@github.com:huggingface/dataspeech.git && cd dataspeech && pip install -r requirements.txt
105
+ cd ..
106
+ git clone https://github.com/huggingface/parler-tts.git && cd parler-tts && pip install -e .[train]
107
+ cd ..
108
+ ```
109
 
110
+ You can link your Hugging Face account so that you can push model repositories on the Hub. This will allow you to save
111
+ your trained models on the Hub so that you can share them with the community. Simply run the command:
112
 
113
+ ```sh
114
+ git config --global credential.helper store
115
+ huggingface-cli login
116
+ ```
117
 
118
+ And then enter an authentication token from https://huggingface.co/settings/tokens. Create a new token if you do not
119
+ have one already. You should make sure that this token has "write" privileges.
120
 
121
+ You also have the option to configure Accelerate by running the following command. Note that you should set the number
122
+ of GPUs you wish to use for training/inference, and also the data type (dtype) based on your device (e.g. bfloat16 on
123
+ A100 GPUs, float16 on V100 GPUs, etc.):
124
 
125
+ ```sh
126
+ accelerate config
127
+ ```
128
 
129
+ Optionally, you can also login to Weights and Biases for automatic logging:
130
 
131
+ ```sh
132
+ wandb login
133
+ ```
134
 
135
+ ### Step 1: Create Text Descriptions
136
 
137
+ Creating text descriptions for the dataset comprises three sub-stages from DataSpeech, which we'll cover below.
138
 
139
+ #### 1.A. Annotate the Expresso dataset
140
 
141
+ We'll use the [`main.py`](https://github.com/huggingface/dataspeech/blob/main/main.py) file from DataSpeech to label
142
+ the following continuous variables:
143
+ - Speaking rate
144
+ - Signal-to-noise ratio (SNR)
145
+ - Reverberation
146
+ - Speech monotony
147
 
148
+ This can be done with the following command:
149
+ ```sh
150
+ python ./dataspeech/main.py "ylacombe/expresso" \
151
+ --configuration "default" \
152
+ --text_column_name "text" \
153
+ --audio_column_name "audio" \
154
+ --cpu_num_workers 8 \
155
+ --rename_column \
156
+ --repo_id "expresso-tags"
157
+ ```
158
 
159
+ Note that the script will be faster if you have GPUs at your disposal. It will automatically scale up to every GPU available in your environment.
160
 
161
+ The resulting dataset will be pushed to the Hugging Face Hub under your Hugging Face handle. Mine was pushed to [reach-vb/expresso-tags](https://huggingface.co/datasets/reach-vb/expresso-tags).
162
+ We can see that the dataset is annotated with continuous features like "speaking_rate" and "snr".
163
 
164
+ #### 1.B. Map annotations to text bins
165
 
166
+ The next step involves mapping the continuous variables to discrete ones. This is achieved by binning the continuous variables
167
+ into buckets, and assigning each one a text label.
168
 
169
+ Since the ultimate goal here is to fine-tune the [Parler-TTS v0.1 checkpoint](https://huggingface.co/parler-tts/parler_tts_mini_v0.1)
170
+ on the Expresso dataset, we want to stay consistent with the text bins of the dataset on which the original model was trained.
171
 
172
+ To do this, we'll pass [`v01_bin_edges.json`](https://github.com/huggingface/dataspeech/blob/main/examples/tags_to_annotations/v01_bin_edges.json)
173
+ as an input argument to our script, which holds the bin edges from the original dataset:
174
 
175
+ ```sh
176
+ python ./dataspeech/scripts/metadata_to_text.py \
177
+ "reach-vb/expresso-tags" \
178
+ --repo_id "expresso-tags" \
179
+ --configuration "default" \
180
+ --cpu_num_workers "8" \
181
+ --path_to_bin_edges "./examples/tags_to_annotations/v01_bin_edges.json" \
182
+ --avoid_pitch_computation
183
+ ```
184
+
185
+ Since we leverage the bins from the original dataset, the above script only takes a few seconds. The resulting dataset
186
+ will be pushed to the Hugging Face Hub under your Hugging Face handle. Mine was pushed to [reach-vb/expresso-tags](https://huggingface.co/datasets/reach-vb/expresso-tags).
187
+
188
+ You can notice that text bins such as "slightly noisy", "quite monotone" have been added to the samples.
189
+
190
+ #### 1.C. Create natural language descriptions from those text bins
191
+
192
+ Now that we have text bins associated to the Expresso dataset, the next step is to create natural language descriptions.
193
+ This involves passing the text bins to a large-language model (LLM), and have it generate corresponding descriptions.
194
+
195
+ There is a template [prompt creation script](https://github.com/huggingface/dataspeech/blob/main/scripts/run_prompt_creation.py)
196
+ in Parler-TTS which can be used to generate descriptions from the features tagged in [step 1.A](#1a-annotate-the-expresso-dataset) (reverberation, noise, speaking rate, etc).
197
+
198
+ However, not all of these features are relevant for the Expresso dataset. For instance, Expresso was recorded in a
199
+ professional recording studio, so all the samples are high quality. Thus, we chose to create prompts with the following subset of features:
200
+ 1. Name: we mapped the speaker ids (ex1, ex2, ex3, ex4) to unique speaker names (Jerry, Elisabeth, Thomas, Talia). This encourages the model to learn specific speakers from the training data
201
+ 2. Emotion: we include the emotion provided in the Expresso dataset
202
+ 3. Speaking rate: we use the pre-computed text bins from the previous step
203
+
204
+ 4. In addition, we also hard-coded the quality of the audio to be "very high-quality", given the studio recording conditions.
205
+
206
+ As an example, if we passed:
207
+ 1. Speaker: Jerry
208
+ 2. Emotion: confused
209
+ 3. Speaking rate: moderate speed
210
+
211
+ We would expect to generate a sample along the lines of: "Jerry speaks with a confused tone and at a moderate speed with high quality audio."
212
+
213
+ The modified prompt creation script can be found in this repository. You can download this script with the following Python command:
214
+
215
+ ```python
216
+ from huggingface_hub import hf_hub_download
217
+
218
+ hf_hub_download(repo_id="parler-tts/parler_tts_mini_expresso_v0.1", filename="run_prompt_creation.py", local_dir="./run_prompt_creation_expresso.py")
219
+ ```
220
+
221
+ You can then launch prompt creation using the [Mistral Instruct 7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
222
+ model with the following command:
223
+
224
+ ```sh
225
+ accelerate launch ./dataspeech/run_prompt_creation_expresso.py \
226
+ --dataset_name "reach-vb/expresso-tags" \
227
+ --dataset_config_name "default" \
228
+ --model_name_or_path "mistralai/Mistral-7B-Instruct-v0.2" \
229
+ --per_device_eval_batch_size 32 \
230
+ --attn_implementation "sdpa" \
231
+ --dataloader_num_workers 8 \
232
+ --output_dir "./tmp_expresso" \
233
+ --load_in_4bit \
234
+ --push_to_hub \
235
+ --hub_dataset_id "expresso-tagged-w-speech-mistral" \
236
+ --preprocessing_num_workers 16
237
+ ```
238
+
239
+ Note that the Mistral model is gated, so you should ensure you have accepted the terms-of-use from the [model card](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2).
240
+ You can find the annotated dataset under TODO [reach-vb/expresso-tagged-w-speech-mistral](https://huggingface.co/datasets/reach-vb/expresso-tagged-w-speech-mistral),
241
+ where you'll find sensible descriptions from the features that we passed.
242
+
243
+ This step generally demands more resources and times and should use one or many GPUs. Scaling to multiple GPUs using [distributed data parallelism (DDP)](https://pytorch.org/tutorials/beginner/ddp_series_theory.html)
244
+ is trivial: simply run `accelerate config` and select the multi-GPU option, specifying the IDs of the GPUs you wish to use. The
245
+ above script can then be run using DDP with no code changes.
246
+
247
+ If you are resource constrained and need to use a smaller model, [Gemma 2B](https://huggingface.co/google/gemma-2b-it)
248
+ is an excellent choice.
249
+
250
+ ### Step 2: Fine-Tune the Model
251
+
252
+ Fine-tuning is performed using the Parler-TTS training script [run_parler_tts_training.py](https://github.com/huggingface/parler-tts/blob/main/training/run_parler_tts_training.py).
253
+ It is the same script used to pre-train the model, and can be used for fine-tuning without any code-changes.
254
+
255
+ To preserve the model's ability to generate speech with generic voice descriptions, such as in the style of
256
+ [Parler-TTS Mini v0.1](https://huggingface.co/parler-tts/parler_tts_mini_v0.1), we fine-tuned the model
257
+ on a combination of three datasets, including the test split of LibriTTS-R:
258
+ 1. [Expresso](https://huggingface.co/datasets/ylacombe/expresso)
259
+ 2. [Jenny](https://huggingface.co/datasets/reach-vb/jenny_tts_dataset)
260
+ 3. [LibriTTS-R](https://huggingface.co/datasets/blabble-io/libritts_r)
261
+
262
+ This was achieved through the following command:
263
+
264
+ ```sh
265
+ accelerate launch ./training/run_parler_tts_training.py \
266
+ --model_name_or_path "parler-tts/parler_tts_mini_v0.1" \
267
+ --feature_extractor_name "parler-tts/dac_44khZ_8kbps" \
268
+ --description_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
269
+ --prompt_tokenizer_name "parler-tts/parler_tts_mini_v0.1" \
270
+ --report_to "wandb" \
271
+ --overwrite_output_dir true \
272
+ --train_dataset_name "ylacombe/expresso+reach-vb/jenny_tts_dataset+blabble-io/libritts_r+blabble-io/libritts_r" \
273
+ --train_metadata_dataset_name "reach-vb/expresso-tagged-w-speech-mistral-v3+ylacombe/jenny-tts-10k-tagged+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated" \
274
+ --train_dataset_config_name "read+default+clean+other" \
275
+ --train_split_name "train+train[:20%]+test.clean+test.other" \
276
+ --eval_dataset_name "ylacombe/expresso+reach-vb/jenny_tts_dataset+blabble-io/libritts_r+blabble-io/libritts_r" \
277
+ --eval_metadata_dataset_name "reach-vb/expresso-tagged-w-speech-mistral-v3+ylacombe/jenny-tts-10k-tagged+parler-tts/libritts_r_tags_tagged_10k_generated+parler-tts/libritts_r_tags_tagged_10k_generated" \
278
+ --eval_dataset_config_name "read+default+clean+other" \
279
+ --eval_split_name "train+train[:20%]+test.clean+test.other" \
280
+ --max_eval_samples 8 \
281
+ --per_device_eval_batch_size 16 \
282
+ --target_audio_column_name "audio" \
283
+ --description_column_name "text_description" \
284
+ --prompt_column_name "text" \
285
+ --max_duration_in_seconds 30.0 \
286
+ --min_duration_in_seconds 2.0 \
287
+ --max_text_length 400 \
288
+ --preprocessing_num_workers 2 \
289
+ --do_train true \
290
+ --num_train_epochs 8 \
291
+ --gradient_accumulation_steps 8 \
292
+ --gradient_checkpointing true \
293
+ --per_device_train_batch_size 16 \
294
+ --learning_rate 0.00008 \
295
+ --adam_beta1 0.9 \
296
+ --adam_beta2 0.99 \
297
+ --weight_decay 0.01 \
298
+ --lr_scheduler_type "cosine" \
299
+ --warmup_steps 250 \
300
+ --logging_steps 2 \
301
+ --freeze_text_encoder true \
302
+ --audio_encoder_per_device_batch_size 4 \
303
+ --dtype "bfloat16" \
304
+ --seed 456 \
305
+ --output_dir "./parler-tts-mini-expresso" \
306
+ --temporary_save_to_disk "./audio_code_tmp" \
307
+ --save_to_disk "./tmp_dataset_audio" \
308
+ --dataloader_num_workers 4 \
309
+ --do_eval \
310
+ --predict_with_generate \
311
+ --include_inputs_for_metrics \
312
+ --group_by_length true
313
+ ```
314
+
315
+ On a single 80GB A100 GPU, training took approximately 1.5 hours and returned a final evaluation loss of 4.0. Again, the
316
+ script can be configured for multiple GPUs by running `accelerate config` from the command line; no further
317
+ code-changes are required.
318
+
319
+ Training performance is quite sensitive to learning rate and number of epochs: you should tune these according to your task
320
+ and the size of your dataset. In our experiments, we found the best performance to occur after 8 epochs of training
321
+ with a learning rate of 8e-5.
322
+
323
+ If you followed to the end of these steps: congratulations! You should now have a fine-tuned model you can use for your
324
+ downstream applications using the [inference code-example](#usage) above. You can try substituting your own dataset, or
325
+ run training using a single-speaker dataset, like the [Jenny example](https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb).
326
+
327
+ ## Motivation
328
+
329
+ Parler-TTS is a reproduction of work from the paper [Natural language guidance of high-fidelity text-to-speech with synthetic annotations](https://www.text-description-to-speech.com) by Dan Lyth and Simon King, from Stability AI and Edinburgh University respectively.
330
+
331
+ Contrarily to other TTS models, Parler-TTS is a **fully open-source** release. All datasets, pre-processing, training code and weights are released publicly under permissive license, enabling the community to build on our work and develop their own powerful TTS models.
332
+ Parler-TTS was released alongside:
333
+ * [The Parler-TTS repository](https://github.com/huggingface/parler-tts) - you can train and fine-tuned your own version of the model.
334
+ * [The Data-Speech repository](https://github.com/huggingface/dataspeech) - a suite of utility scripts designed to annotate speech datasets.
335
+ * [The Parler-TTS organization](https://huggingface.co/parler-tts) - where you can find the annotated datasets as well as the future checkpoints.
336
+
337
+ ## Citation
338
+
339
+ If you found this repository useful, please consider citing this work and also the original Stability AI paper:
340
+
341
+ ```
342
+ @misc{lacombe-etal-2024-parler-tts,
343
+ author = {Yoach Lacombe and Vaibhav Srivastav and Sanchit Gandhi},
344
+ title = {Parler-TTS},
345
+ year = {2024},
346
+ publisher = {GitHub},
347
+ journal = {GitHub repository},
348
+ howpublished = {\url{https://github.com/huggingface/parler-tts}}
349
+ }
350
+ ```
351
+
352
+ ```
353
+ @misc{lyth2024natural,
354
+ title={Natural language guidance of high-fidelity text-to-speech with synthetic annotations},
355
+ author={Dan Lyth and Simon King},
356
+ year={2024},
357
+ eprint={2402.01912},
358
+ archivePrefix={arXiv},
359
+ primaryClass={cs.SD}
360
+ }
361
+ ```
362
+
363
+ ## License
364
+
365
+ This model is permissively licensed under the Apache 2.0 license.