What is the minimum length for cloning voice?
I took a sample of about 45 seconds of voice.
I am pulling out transcription from a whisper model, make them lowercase and remove all none latin chars, convert the audio to 16k and send it to inference with the transcription.
And i'm getting: RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size
Is it my sample size or am i doing something wrong?
I'm getting the same error when trying to make it speak; the training seems to work. This seems like a bug in the speaker.
Could you share the full error message? It would also be helpful if you could share the audio sample along with the transcription.
As for preprocessing:
make them lowercase and remove all none latin chars, convert the audio to 16k
you don’t have to do this manually yourself, the library takes care of this.
/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/utils/weight_norm.py:143: FutureWarning: torch.nn.utils.weight_norm
is deprecated in favor of torch.nn.utils.parametrizations.weight_norm
.
WeightNorm.apply(module, name, dim)
making attention of type 'vanilla' with 768 in_channels
/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/decoder/pretrained.py:101: FutureWarning: You are using torch.load
with weights_only=False
(the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only
will be flipped to True
. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals
. We recommend you start setting weights_only=True
for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict_raw = torch.load(model_path, map_location="cpu")['state_dict']
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
- Avoid using tokenizers
before the fork if possible
- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Creating speaker model...
Trying generation with speaker...
Generation text length: 254 characters
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Setting pad_token_id
to eos_token_id
:None for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask
to obtain reliable results.
Traceback (most recent call last):
File "/Users/samliu/code/playground/voice_cloning/test_outetts.py", line 29, in
test_speaker_creation()
File "/Users/samliu/code/playground/voice_cloning/test_outetts.py", line 17, in test_speaker_creation
output = interface.generate(
^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/interface.py", line 102, in generate
audio = self.get_audio(output[input_ids.size()[-1]:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/interface.py", line 48, in get_audio
return self.audio_codec.decode(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/audio_codec.py", line 68, in decode
audio_out = self.wavtokenizer.decode(features, bandwidth_id=torch.tensor([0]).to(self.device))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/decoder/pretrained.py", line 205, in decode
x = self.backbone(features_input, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/outetts/v0_1/decoder/models.py", line 224, in forward
x = self.embed(x)
^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/conv.py", line 375, in forward
return self._conv_forward(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/samliu/Library/Python/3.11/lib/python/site-packages/torch/nn/modules/conv.py", line 370, in _conv_forward
return F.conv1d(
^^^^^^^^^
RuntimeError: Calculated padded input size per channel: (6). Kernel size: (7). Kernel size can't be greater than actual input size
This seems to be the same issue as GitHub Issue #16. I've posted an explanation here: Issue Comment.