Jibberish readings

#1
by chlowden - opened

Hello
I have been testing your model and I am getting a very strange result

Context : A female voice delivers a slightly expressive and animated speech with a moderate speed. The recording features a high-pitch voice, creating a close-sounding audio experience. Very clear audio.

La certitude du hasard.

En 2007, Peter Backus était étudiant en macroéconomie à l'université de Londres, et il n'avait pas de petite amie.

Dans sa solitude, il se demandait comment calculer ses chances de rencontrer une petite amie potentielle un soir normal en ville.

En 1952, l'astrophysicien américain Frank Drake a résumé tous les facteurs nécessaires pour prédire la probabilité de détecter une vie extraterrestre intelligente dans la Voie lactée.

Backus a adapté la formule de Drake, échangeant la Voie lactée pour le Royaume-Uni. Il a supposé que la population nationale du Royaume-Uni était de 60 975 000 habitants.

51 % étaient des femmes.
13 % de ces femmes vivaient à Londres.

20 % d'entre elles avaient entre 24 et 34 ans.
À 26 % de cet âge, le groupe approprié avait fait des études universitaires.

Is the model limited to 30secs? Is there a reason why the voice turns to gibberish very quickly? Thank you for all your work and sharing it.

Hello!

Well first, thank you for your interest in this model.

The 30secs limit is due to the model being trained on audio segments of max 30secs, same as the original parler-TTS.

Now concerning the audio turning to gibberish, here is what I think based on my testing:

  • It depends on the context phrase, some context phrase seem to work better, but they all will fail at one point or another.
  • When encountering numbers like 2007 or percentage like 51% the model will often go off-rails, I think it's because there is not enough examples with numbers in the training set
  • Same thing with "rare words"

To sum up, I think the model is undertrained, I thought the 1300hours of the Emilia dataset would be enough given that it is just a fine tune of the english one, but unfortunately I was wrong.
So right now as it is, it's unusable.

I do plan to train one from scratch using the 50k hours from the mosel dataset which should overcome the shortcomings of that first experiment, but compute might be an issue.

Have a great day!

Sign up or log in to comment