hi!
hello! i wanted to contact you to ask some questions about how you created this model, but couldn't find a way to communicate to you directly. i hope using the community tab is an alright method to do so.
what data did you use when training this model? where/how did you train it? how much data did you use? did you directly take audio from videos, and if so, how did you clean up the audio?
any answer would be helpful, this model is so amazing, it's insane.
thanks so much!
Hi!
So sorry I didn't notice your message earlier, summers over and I've got classes now lol.
The data was I think a couple of Hermitcraft episodes. I chose ones that had the least other people featured. Ran the audios through UVR voc ft, cut all sections where he talks to other people, also cut parts where UVR couldnt remove ingame sounds like block breaking or animals. Silence was truncated to reduce dataset size to around 30 mins. Finally i trained through google colab (when it was still easily possible to train there). Using crepe, up to just 150 epoch.
In general, expressive voices turn out best, and he is pretty expressive in his episodes.
And thats bout it!
thank you so much for getting back to me!