bark-voice-cloning / README.md
GitMylo's picture
Update README.md
c26e70f
|
raw
history blame
3.26 kB
---
license: mit
pipeline_tag: feature-extraction
tags:
- bark
- tts
- hubert
- text-to-speech
datasets:
- GitMylo/bark-semantic-training
---
# Bark-voice-cloning
Bark-voice-cloning is a model which processes the outputs from a HuBERT model, and turns them into semantic tokens compatible with bark text to speech.
This can be used for many things, including speech transfer and voice cloning.
[code repo](https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer)
[audio webui](https://github.com/gitmylo/audio-webui)
# The models in this repo
* [quantifier_hubert_base_ls960.pth](https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_hubert_base_ls960.pth) (the model trained on literature for 4 epochs)
* [quantifier_hubert_base_ls960_14.pth](https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_hubert_base_ls960_14.pth) (the model trained on literature for 10 more epochs, based on the previous)
* [quantifier_V1_hubert_base_ls960_23.pth](https://huggingface.co/GitMylo/bark-voice-cloning/blob/main/quantifier_V1_hubert_base_ls960_23.pth) (a larger model, trained for more epochs on the same dataset)
(Please use the model manager from the [code repo](https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer) for easy downloading of models)
# Voice cloning
Voice cloning is creating a new voice for text-to-speech.
Process:
1. Load your wav audio file into your pytorch application
2. For the fine prompt extract [discrete representations](https://github.com/facebookresearch/encodec#extracting-discrete-representations). (These are used by bark to know about the voice), **make sure to `.squeeze()` the resulting codes.**
3. For the coarse prompt do `fine_prompt[:2, :]`, to get the coarse prompt from a fine prompt.
4. For the semantics, load a HuBERT model without Kmeans (I personally use the [audiolm-pytorch](https://github.com/lucidrains/audiolm-pytorch) implementation's hubertwithkmeans, but i edited it to skip kmeans.)
5. Next, to get the actual semantic tokens, run the tokens through this model. Your output will be compatible with bark.
6. Save these files in an npz with `numpy.savez(semantic_prompt=semantics, fine_prompt=fine, coarse_prompt=coarse)`. This is your speaker file containing your cloned voice.
# Voice masking
Voice masking is replacing a voice in an audio clip for speech-to-speech.
## Random
Replacing a voice in an audio clip with a voice generated by bark.
process:
1. Extract semantics from the audio clip using HuBERT and this model
2. Run `semantic_to_waveform` from `bark.api` with the extracted semantics
3. The previous step returns the generated audio.
## Transfer
Replacing a voice with a voice from another audio clip.
process:
1. Create a speaker file using the steps under the voice cloning section
2. Extract the semantics from the clip with the text you want spoken
3. Run `semantics_to_waveform` from `bark.api` with the extracted semantics, and the speaker prompt that you created on step 1.
4. The previous step returns the generated audio.
# Disclaimer
I am not responsible for any misuse of this model. I do not agree with cloning people's voices without permission. Please make sure it is appropriate to clone someone's voice before doing so.