Truncated audios and Latency in generation of speech

by tushar310 - opened 9 days ago

9 days ago

Hi Team,
Using Llama-cpp-python version of this also in Q4 variant. The results seem very different. The audio is 90%+ truncated and at the same time it takes a good number of seconds to generate a sentence. Should we say this wont suffice realtime aspects that the likes of Azure, ElevenLabs, Google can do? Correct? If i am wrong, please suggest the apt implementation strategy.

edwko

OuteAI org 8 days ago

What do you mean by "audio is 90%+ truncated"? Sounds very unusual. What hardware specs are you running this model on?

rakker

7 days ago

Hey Edwko, i had a similar issue with the audio being truncated it cuts of half a second at the begining and half a second at the end of the audio clip.

edwko

OuteAI org 7 days ago

@rakker are you playing the audio via output.play() or the saved file? If you're playing with .play() it's probably related to this issue https://github.com/edwko/OuteTTS/issues/45#issuecomment-2525099911 there might be some compatibility issues with the sounddevice library.

rakker

6 days ago

@edwko no i am playing the saved file

edwko

OuteAI org 3 days ago

@rakker Seems like a playback issue on your end then. Try resampling the audio, maybe your device doesn't like 24k sr

# generate audio ...

import torchaudio
new_sr = 44100 
resampler = torchaudio.transforms.Resample(orig_freq=output.sr, new_freq=new_sr).to(output.audio.device)
resampled_audio = resampler(output.audio)
output.sr = new_sr
output.audio = resampled_audio
output.save("output.wav")

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment