Tensors size error when generating audio incrementally.

#16
by severos - opened

Generating audio incrementally using 2 seconds steps:

  1. 2 seconds are generated by text condition "80s pop synth guitars and heavy drums"
  2. Write result to file
  3. Read previous file
  4. Generate 2 seconds using text condition "80s pop synth guitars and heavy drums" and previous audio file
  5. Repeat from 2

So far so good, everything works.
However, when changing the text condition in step 4, ex: text condition is "80s pop track with bassy drums and synth", "90s rock song with loud guitars and heavy drums", a runtime error will be thrown:

Traceback (most recent call last):
  File "main.py", line 32, in <module>
    audio_values = model.generate(**inputs, max_new_tokens=128)
  File "C:\Python38\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Python38\lib\site-packages\transformers\models\musicgen\modeling_musicgen.py", line 2279, in generate
    input_ids, model_kwargs = self._prepare_decoder_input_ids_for_generation(
  File "C:\Python38\lib\site-packages\transformers\models\musicgen\modeling_musicgen.py", line 1993, in _prepare_decoder_input_ids_for_generation
    decoder_input_ids = torch.cat([decoder_input_ids_start, decoder_input_ids], dim=-1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 8 but got size 4 for tensor number 1 in the list.

I expected this would work without issue since the model should treat each iteration from step 3 as a separate prompt processing, or am I misunderstanding something in here?

Sign up or log in to comment