FineTuning for Single Speaker

by skjdhuhsnjd - opened Dec 20, 2024

Dec 20, 2024

Hi, I'm new to IndicParler TTS. I'm trying to fine-tune it for a single speaker, but I'm encountering this error: 'TypeError: 'NoneType' object is not subscriptable'.

I suspect the issue might be related to using --feature_extractor_name "parler-tts/dac_44khZ_8kbps" because I couldn't find a feature extractor specifically for IndicParler. I'm a beginner and would appreciate some guidance.

AshwinSankar

AI4Bharat org Dec 20, 2024

Hi,

We do not train or finetune DAC on Indic Parler TTS data, but rather use the pretrained one from ylacombe/dac_44khz. You should be able to use that. That being said, AutoProcessor.from_pretrained("ai4bharat/indic-parler-tts", trust_remote_code=True) should also work. Would be able to look into it if you can share a code snippet.

Thank you for showing interest in Indic Parler TTS.

skjdhuhsnjd

Dec 20, 2024

•

edited Dec 22, 2024

First of all, thank you so much for your time. I'm using the following script:

!accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "ai4bharat/indic-parler-tts-pretrained" \ --feature_extractor_name "ylacombe/dac_44khz" \ --description_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --prompt_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --report_to "wandb" \ --overwrite_output_dir true \ --train_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --train_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --train_dataset_config_name "default" \ --train_split_name "train" \ --eval_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --eval_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --eval_dataset_config_name "default" \ --eval_split_name "train" \ --max_eval_samples 8 \ --per_device_eval_batch_size 8 \ --target_audio_column_name "audio" \ --description_column_name "text_description" \ --prompt_column_name "text" \ --max_duration_in_seconds 20 \ --min_duration_in_seconds 2.0 \ --max_text_length 400 \ --preprocessing_num_workers 2 \ --do_train true \ --num_train_epochs 2 \ --gradient_accumulation_steps 18 \ --gradient_checkpointing true \ --per_device_train_batch_size 2 \ --learning_rate 0.00008 \ --adam_beta1 0.9 \ --adam_beta2 0.99 \ --weight_decay 0.01 \ --lr_scheduler_type "constant_with_warmup" \ --warmup_steps 50 \ --logging_steps 2 \ --freeze_text_encoder true \ --audio_encoder_per_device_batch_size 4 \ --dtype "float16" \ --seed 456 \ --output_dir "./output_dir_training/" \ --temporary_save_to_disk "./audio_code_tmp/" \ --save_to_disk "./tmp_dataset_audio/" \ --dataloader_num_workers 2 \ --do_eval \ --predict_with_generate \ --include_inputs_for_metrics \ --group_by_length true

However, I keep getting this error:
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', ...]' returned non-zero exit status 1

When I use the tokenizer (ylacombe/parler-tts-mini-v1-Jenny-colab) for both description and prompt, the process completes without errors, but the output audio quality is terrible. You can check the audio samples here: (https://wandb.ai/sjahk-/parler-speech/reports/Speech-samples-24-12-20-19-31-39---VmlldzoxMDY3NzI5Mw?accessToken=lmtsm2zj12qoc0nl8os0dgpdgyorvbufbgrqjnzfb1bqmfxmnak35cnxspoo6pgc)

Could you please guide me on the appropriate description and prompt tokenizer to use for fine-tuning in Hindi? Thanks in advance!

skjdhuhsnjd

Dec 21, 2024

Any help would mean a lot! I believe the issue might be with the prompt or description tokenizer.

AshwinSankar

AI4Bharat org Dec 21, 2024

Hi @skjdhuhsnjd ,

Please use flan-t5-large tokenizer as that is our description encoder as well. This model works pretty well for our use case as the descriptions are still in English, and FlanT5 is instruction tuned which means better representations even without training it.

AshwinSankar

AI4Bharat org Dec 22, 2024

For any clarification on which models where used, please look at the config: https://huggingface.co/ai4bharat/indic-parler-tts/blob/main/config.json

skjdhuhsnjd

Dec 22, 2024

•

edited Dec 22, 2024

Hi @AshwinSankar

First of all, thank you so much for your time. I’m really sorry to bother you, but as a beginner, your help means a lot to me. I was using this notebook:

https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb

to fine-tune the Indic Parler pretrained model.

I replaced the model path with "ai4bharat/indic-parler-tts-pretrained", the prompt and description tokenizer with "google/flan-t5-large", and the feature extractor with "ylacombe/dac_44khz".

However, I’m still encountering this error:
TypeError: dacmodel.encode() got an unexpected keyword argument 'bandwidth'

I’d be incredibly grateful if you could take some time from your busy schedule to guide me through this issue. Thank you so much in advance!

AshwinSankar

AI4Bharat org Dec 24, 2024

which version of transformers are you using?

skjdhuhsnjd

Dec 25, 2024

I'm using Google Colab with Transformers version 4.46.1.

Shanos76

Jan 3

This comment has been hidden

skjdhuhsnjd

Jan 3

Hi @AshwinSankar

I've tried my best, but I haven't been able to resolve the problem. Could you please take a look at it?

Thank you!

mishra999

Feb 6

Was anyone able to resolve this issue???

AshwinSankar

AI4Bharat org Feb 6

I will post a detailed tutorial notebook for doing this after Feb 20. Thank you for your patience

mishra999

Feb 6

i removed the bandwidth parameter as it was not being accepted by the model and the training started, however i am not sure if this is a correct approach:
def apply_audio_decoder(batch):
len_audio = batch.pop("len_audio")
audio_decoder.to(batch["input_values"].device).eval()
'''if bandwidth is not None:
batch["bandwidth"] = bandwidth'''
if "num_quantizers" in encoder_signature:
batch["num_quantizers"] = num_codebooks
elif "num_codebooks" in encoder_signature:
batch["num_codebooks"] = num_codebooks
elif "n_quantizers" in encoder_signature:
batch["n_quantizers"] = num_codebooks

        with torch.no_grad():
            labels = audio_decoder.encode(**batch)["audio_codes"]
        output = {}
        output["len_audio"] = len_audio
        # (1, bsz, codebooks, seq_len) -> (bsz, seq_len, codebooks)
        output["labels"] = labels.squeeze(0).transpose(1, 2)

        # if `pad_to_max_length`, the maximum corresponding audio length of the current batch is max_duration*sampling_rate
        max_length = len_audio.max() if padding != "max_length" else max_target_length
        output["ratio"] = torch.ones_like(len_audio) * labels.shape[-1] / max_length
        return output..

AshwinSankar

AI4Bharat org Feb 6

This is indeed correct correct. Instead you can check if "bandwidth" in encoder_signature like the rest of the if conditions too.

mishra999

Feb 6

Thanks for your help.

sivakgp

1 day ago

HI @AshwinSankar
I am trying to fine tune the model on ylacombe/jenny-tts-6h dataset. But the final output is very random, It does not have any relation with the input audio.
!accelerate launch ./training/run_parler_tts_training.py
--model_name_or_path "ai4bharat/indic-parler-tts"
--feature_extractor_name "ylacombe/dac_44khz"
--description_tokenizer_name "google/flan-t5-large"
--prompt_tokenizer_name "google/flan-t5-large" \

This is the code I used. Is this the way correct? please help.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment