Speaker ids - genders and accents

by jordimas - opened Sep 27, 2024

Sep 27, 2024

Hello,

Is there is any documentation available of the speakers id supported?

I guess that every speaker ID corresponds to a combination of gender and language variant but I have not been able to found a list.

Thanks

Jordi,

albertcanig

Projecte Aina org Sep 27, 2024

Hi Jordi,

As starting point check if the info here can help? https://github.com/langtech-bsc/Matcha-TTS/commit/9aebd8be5f4c674a06ec3210a8d8840d96b21d35

But in any case that should be added to the documentation. Thanks for pointing it out

Albert

AlexK-PL

Projecte Aina org Sep 27, 2024

Hello Jordi,

we have a couple of matxa versions. One trained only with Catalan central accent ( @albertcanig info refers to speaker IDs of this model), and the multiaccent version, which indeed was trained with a total of 8 speakers (1 female and 1 male per accent). Here you have the info:

{
"balear":{
"quim": 0,
"olga": 1
},
"central":{
"grau": 2,
"elia": 3
},
"nord-occidental":{
"pere": 4,
"emma": 5
},
"valencia":{
"lluc": 6,
"gina": 7
}
}

We definetely upload this info to the model card. Thanks!

Alex

Baybars

Projecte Aina org Sep 27, 2024

Hi Jordi,

As Alex says we need to add this to the documentation clearly. But just in case you run into more issues, we use our HF spaces code as reference for these issues. The code is here for example in this case it is pointing to the json here, basically the info Alex just gave.

Sorry for the inconvenience, hope the example code helps until we introduce the relevant information in the model card.

Best

jordimas

Sep 30, 2024

•

edited Sep 30, 2024

Thanks for documentation!

I few things that AI have observed. I am following the README.md instructions in this repo:

Using "matcha_vocos_inference.py --speaker_id 2" produces a woman's voice for me, but if I understand this correctly, it should be 2 should be grau / central.
Another thing that I have observed is that matcha_vocos_inference.py uses speaker ID=20:

parser.add_argument('--speaker_id', type=int, default=20, help='Speaker ID')

Which is not in the mapping

Also the I have obseved that the inference code here https://huggingface.co/spaces/projecte-aina/matxa-alvocat-tts-ca/blob/main/infer_onnx.py for the TTS method seems to be
different to the one in matcha_vocos_inference.py"

Really 1) is that I want to fix, I'm commenting 2 and 3 in case it thelps.

Thanks

Jordi

wetdog

Projecte Aina org Sep 30, 2024

•

edited Sep 30, 2024

Hi Jordi, thanks for the feedback.

The model downloads by default the multispeaker version from HF https://github.com/langtech-bsc/Matcha-TTS/blob/dev-cat/matcha_vocos_inference.py#L128, you can change that to point to the multiaccent model handle "projecte-aina/matxa-tts-cat-multiaccent"
The inference code in the space is different as is using a ONNX version of the model, it also has an additional denoising step cause the inference is using less steps(10) to generate the mel spectrograms.

It's important to match each speaker_id with its corresponding text cleaner. Currently, the script is using a catalan central phonemizer, but we plan to update the code with specific cleaners for each accent. We'll provide more information once these changes are implemented.

jordimas

Sep 30, 2024

Thanks I will wait until you update matcha_vocos_inference.py

If you plan that people users this as CLI tool, it will be good also to include it here:
https://github.com/langtech-bsc/Matcha-TTS/blob/dev-cat/setup.py#L40

then when you do "pip install" gets installed and you can call it like a comand like " matcha_vocos_inference"

Jarbas

Sep 30, 2024

I have this packaged as a OVOS plugin here https://github.com/OpenVoiceOS/ovos-tts-plugin-matxa-multispeaker-cat

in case it is useful as a reference

wetdog

Projecte Aina org Oct 1, 2024

Thanks Casimiro, we've added the OVOS plugin instructions to the ONNX section of the model card.

Jordi, we've also updated the repository with the text cleaners. The only differences between the OVOS plugin and PyTorch inference scripts are:

in the OVOS plugin, n_timesteps is fixed at 10 due to ONNX export
the OVOS plugin doesn't require PyTorch, which is an advantage for deployments.

jordimas

Oct 3, 2024

•

edited Oct 3, 2024

Thanks so much for the fixes and additional documentation

I'm focusing on the inference scripts as described here:

https://huggingface.co/projecte-aina/matxa-tts-cat-multiaccent

When I do:

python3 matcha_vocos_inference.py --output_path=output/ --text_input="Bon dia Manel, avui anem a la muntanya." --speaker_id 2

My expectation was grau's male voice but instead I get a woman's voice.

If I do:

python3 matcha_vocos_inference.py --output_path=output/ --text_input="Bon dia Manel, avui anem a la muntanya." --speaker_id 3

My expectation was elias's female voice but instead I get a man's voice.

Do you know what the problem may be?

Thanks

wetdog

Projecte Aina org Oct 4, 2024

Hi Jordi,

This is because the script loads the multispeaker model by default, I changed to the multiaccent model but leave the reference for the multispeaker model handle in HF.

Best,

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment