Audio prompt instead of text prompt

#3
by jameshuntercarter - opened

How trivial (or difficult?) would it be to use an audio file as an input instead of text? I would love to be able to use an existing sound effect, like a lightsaber, as the prompt and then "perform it" with a second prompt (as in the current workflow), then generate an output that retains the "likeness" of the first prompt, but with the cadence/performance of the second.

I imagine this must be possible. If you can point me in the right direction I may be able to contribute it. Otherwise, I'd love to see this added!

Hi bro,

Thank you for your interest and idea!

This function could achieve by strategies used in image generation: https://arxiv.org/pdf/2305.14720.

However, it requires such special paired data:
Condition: Audio prompt + text prompt
Target: Audio

@OpenSound Can this be done with new inference code, using the existing models, or would it require training new models?

Sign up or log in to comment