One voice for multiple voiceovers
Hi, I'm trying to make a service for voicing documents.
I'm dividing the text into sentences and voicing it that way, but here's the problem - the voice is different.
Is it possible to set some kind of voice generation SID for more control of streaming?
One thing that will help somewhat is to fix the seed.
I have the same question, without being able to pick the voice it's not a practical TTS model for any serious usage.
Hey
@koplenov
and
@juang3d
, thanks for opening the issue!
It's a problem we're aware of, and one we'll be trying to solve for V1.
It's still very preliminary, but I've also experimented with fine-tuning to get consistent voices. I've finetuned the model on the 30-hours single-speaker high-quality Jenny dataset and got the following checkpoint: ylacombe/parler-tts-mini-jenny-30H.
Usage is more or less the same as Parler-TTS v0.1, just specify they keyword “Jenny” in the voice description:
import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import soundfile as sf
device = "cuda:0" if torch.cuda.is_available() else "cpu"
model = ParlerTTSForConditionalGeneration.from_pretrained("ylacombe/parler-tts-mini-jenny-30H").to(device)
tokenizer = AutoTokenizer.from_pretrained("ylacombe/parler-tts-mini-jenny-30H")
prompt = "Hey, how are you doing today? My name is Jenny, and I'm here to help you with any questions you have."
description = "Jenny speaks at an average pace with an animated delivery in a very confined sounding environment with clear audio quality."
input_ids = tokenizer(description, return_tensors="pt").input_ids.to(device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
set_seed(42)
# specify min length to avoid 0-length generations
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, min_length=10)
audio_arr = generation.cpu().numpy().squeeze()
sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)
Some samples:
Let me know if this helps!
Intresting
but
What about other voices besides "Jenny" :?
@ylacombe
I run the demo above. And I met this error.
You set add_prefix_space
. The tokenizer needs to be converted from the slow tokenizers
Using the model-agnostic default max_length
(=2580) to control the generation length. We recommend setting max_new_tokens
to control the maximum length of the generation.
Calling sample
directly is deprecated and will be removed in v4.41. Use generate
or a custom generation loop instead.
--- Logging error ---
Traceback (most recent call last):
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 1100, in emit
msg = self.format(record)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 943, in format
return fmt.format(record)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 678, in format
record.message = record.getMessage()
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/logging/init.py", line 368, in getMessage
msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
File "/home/leyuan/VivaConversion/research/StyleTTS2/parler.py", line 19, in
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, min_length=10)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/parler_tts/modeling_parler_tts.py", line 2608, in generate
outputs = self.sample(
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2584, in sample
return self._sample(*args, **kwargs)
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2730, in _sample
logger.warning_once(
File "/home/leyuan/micromamba/envs/ns3/lib/python3.10/site-packages/transformers/utils/logging.py", line 329, in warning_once
self.warning(*args, **kwargs)
Message: 'eos_token_id
is deprecated in this function and will be removed in v4.41, use stopping_criteria=StoppingCriteriaList([EosTokenCriteria(eos_token_id=eos_token_id)])
instead. Otherwise make sure to set model.generation_config.eos_token_id
'
Arguments: (<class 'FutureWarning'>,)
Hey @shacharm ,
- I've created descriptions for the Jenny TTS dataset using the guide here.
- Then I simply use the Parler TTS training script from the repository!
Note that I'll upload a Colab with detailed steps, but that the steps above should get you started!
Thanks
@ylacombe
, much appreciated.
Generated a dataset with kids' shows (fun) voices (single voice).
Onward to training.
Thanks!
EDIT: I can see bug below was already fixed here
Verified on my dataset - text_description generated correctly.
@ylacombe - at the very end of the dataspeech process, I believe there's a small bug under dataspeech / run_prompt_creation_single_speaker.py / prepare_dataset, causing the speaker_name not to appear in the final tagged dataset.
It'ssample_prompt.replace(f"[speaker_name]", data_args.speaker_name)
and should besample_prompt = sample_prompt.replace(f"[speaker_name]", data_args.speaker_name)
Otherwise "text_description"'s "[speaker_name]" isn't replaced.
parler finetuning Q:
I've set --train_dataset_config_name "default"
, but I'm unsure what train_dataset_config_name is.
Although optional, run_parler_tts_training.py crashes without it.
What's its purpose?
Hi guys,
If this is of any use/interest, I finetuned the model on libritts speaker 0 (en_US-libritts-high, p3922).
GrigoriiA/parler-tts-mini-Libretta-v0.1
For dataset I took Jenny's texts and made Piper-TTS speak them in the desired voice. I followed instructions for finetuning and basically everything just worked.
Except two things:
- There was a problem generating 5 or so texts audios due to some missing phoneme problem. I just had to exclude the texts.
- During dataset tagging with "noise", the noise was always detected as high, noisy, etc. And since it was just pure piper dictation, I set manually all noise to "quite clear".
First run with all quirks took the whole day. But I guess now it would take me around a couple of hours to do the same (dataset generation + training). The training took a bit less than 1 hour on a rented RTX 4090 in community cloud of RunPod. So it would be 0.50$ if I didn't make mistakes on the way (like "not enough empty space" - the whole process requires up to 40Gb space, etc).
One notion - seems like whatever I tell model, the voice is pretty much stays the same. Is it overtraining? Or is it dataset problem?
I would appreciate if someone competent would take a look at GrigoriiA/libretta-tts-21k-tagged. I'm sure other guys will run into similar problems too.
Piper original
Parler-Libretta "Libretta asks a question in low voice almost whispering"
Parler-Libretta 1 "A female speaker with a slightly low-pitched voice delivers her words quite expressively."
Parler-Libretta "A male speaker with a slightly low-pitched voice delivers his words quite expressively"
Parler-Libretta "A female speaker with a very high-pitched voice speaks very fast."
Parler-Libretta "A happy and cheerful female speaker is speaking extremely slowly."
Can it be the same situation as with LLMs? Perhaps we should not include solely our datasets for fine-tuning, rather we should mix fine-tuning datasets with original Parler dataset in some ratio (like 50/50, 25/75, idk). That way the model won't forget the original training data and will not loose it's ability to produce other voices.
Hey @GrigoriiA ,I believe the model likely overfit, how many hours of training data are you using? Feel free to send your training logs as well!
Note though that the model as it is can't generate whispering or emotions, since they were not labeled in the training dataset. So that won't work anyways.
With regards to your last question, to fully train from scratch I'd say at least 1k hours (to get somewhat decent results), more is better. To fine-tune, you would do fine with 6h, maybe even less. I hope that it helps!
BTW, here is a fine-tuning guide to reproduce fine-tuning on a single speaker dataset, using a free colab GPU: https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb
@ylacombe
yeah, I actually followed your script and tried different number of hours for training - from 30 to around 6.
What I missed was to supply "gender" to my dataset. After I fixed that, and also I mixed your original training dataset into my dataset - the new model started to behave as expected.
My final training dataset was 55% of my data (4100 records) and 45% of old data (1164+332+1000+1000=~3500 records of MLS + Libri).
I noticed your new release of "expresso" where you also mixed datasets (old+new). So my intuitive guess was also right.
And I also noticed that Expresso contains Jenny dataset. But her voice is not working in this set.
prompt = "Hey, how are you doing today? My name is Jenny, and I'm here to help you with any questions you have."
description = "Jenny speaks at an average pace with an animated delivery in a very confined sounding environment with clear audio quality."
I did a retrain of expresso with your data + my voice data with same effect on my name. Perhaps, not enough attention to names?
And I sincerely applaud the results that you and your team get with your training! And also how you share all the steps and datasets. It opens a road to a lot of people to build and create.
How to add Indonesia to Parler?