parler-tts/parler-tts-mini-expresso · problem running this model and other parler models...

I've spent a lot of time trying to get it work properly whether it's the "mini," "large" or "espresso" as this comment is specifically directed towards:

For some reason I can't get past this error:

`prompt_attention_mask` is specified but `attention_mask` is not. A full `attention_mask` will be created. Make sure this is the intended behaviour.

Moreover, in the script I'm pasting below, it only plays the first three sentences of the "prompt". And if I change the "description" at all it'll sometimes only play the first two. I've tried adding "max_tokens" to address this specifically to no success.

@sanchit-gandhi or anyone, can you please help. I'd love to incorporate this into my program but it's so difficult. Here's the current version of my script:

import torch
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer, set_seed
import sounddevice as sd
import time

device = "cuda:0" if torch.cuda.is_available() else "cpu"
model_path = r"D:\Scripts\benchmark_tts\parler-tts-mini-expresso"
attn_implementation = "sdpa" # "eager", "sdpa" or "flash_attention_2"

model = ParlerTTSForConditionalGeneration.from_pretrained(model_path, attn_implementation=attn_implementation).to(device)
# model = ParlerTTSForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.float16).to(device)

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

prompt = "This script processes a body of text one sentence at a time and plays them consecutively. This enables the audio playback to begin sooner instead of waiting for the entire body of text to be processed. The script uses the threading and queue modules that are part of the standard Python library. It also uses the sound device library, which is fairly reliable across different platforms. I hope you enjoy, and feel free to modify or distribute at your pleasure."
description = "Thomas speaks with a slightly low-pitched voice delivers his words quite expressively, in a very confined sounding environment with clear audio quality."

tokenized_input = tokenizer(description, 
                            return_tensors="pt", 
                            padding=True, 
                            truncation=True, 
                            return_attention_mask=True)

input_ids = tokenized_input.input_ids.to(device)
attention_mask = tokenized_input.attention_mask.to(device)

tokenized_prompt = tokenizer(prompt, 
                             return_tensors="pt", 
                             padding=True, 
                             truncation=True, 
                             return_attention_mask=True)

prompt_input_ids = tokenized_prompt.input_ids.to(device)
prompt_attention_mask = tokenized_prompt.attention_mask.to(device)

set_seed(42)
start_time = time.time()

generation = model.generate(input_ids=input_ids, 
                            attention_mask=attention_mask,
                            prompt_input_ids=prompt_input_ids,
                            prompt_attention_mask=prompt_attention_mask)

audio_arr = generation.cpu().numpy().squeeze()
# Convert audio array to float32 for playback compatibility
audio_arr = audio_arr.astype('float32')
end_time = time.time()
processing_time = end_time - start_time
print(f"\033[92mProcessing time: {processing_time:.2f} seconds\033[0m")
sampling_rate = model.config.sampling_rate
sd.play(audio_arr, samplerate=sampling_rate)
sd.wait()