Upgrading Kokoro: natural TTS for short bursts
Kokoro just got an upgrade that substantially improves TTS naturalness for short bursts while maintaining parity for longer utterances.
Before, when you asked Kokoro to say "Hi there!" using the voice Sarah af_sarah
, you'd get this:
The output audio had unnatural breathiness, and that was with default post-processing: (1) both ends trimmed and (2) noise reduction using noisereduce.
Now, the same voice sounds like this on the same text:
This is much better. Also, we no longer import noisereduce
because it sounds similar with or without.
Let's check parity for long utterances. The model was already fairly good at these, so we're at least looking for no regression.
This morning, The Information published an article titled "A Complex New Age of Face Tech". The first sentence reads: "In September, Instagram unveiled a splashy new feature called Teen Accounts, an effort by Meta Platforms, the app’s owner, to show it’s better protecting young people with stricter privacy and safety settings."
Before:
After:
More or less the same. Nitpicking, the Before edition overemphasizes Tech, and the After version inflects up on September when it probably should've stayed flatter instead.
Kokoro wasn't perfect before and still isn't perfect now, but this represents a nontrivial step in the right direction.
You can check out Kokoro at https://hf.co/spaces/hexgrad/Kokoro-TTS