OpenSound/EzAudio-ControlNet · Audio prompt instead of text prompt

Feb 21

How trivial (or difficult?) would it be to use an audio file as an input instead of text? I would love to be able to use an existing sound effect, like a lightsaber, as the prompt and then "perform it" with a second prompt (as in the current workflow), then generate an output that retains the "likeness" of the first prompt, but with the cadence/performance of the second.

I imagine this must be possible. If you can point me in the right direction I may be able to contribute it. Otherwise, I'd love to see this added!

OpenSound

Owner Mar 7

•

edited Mar 7

Hi bro,

Thank you for your interest and idea!

This function could achieve by strategies used in image generation: https://arxiv.org/pdf/2305.14720.

However, it requires such special paired data:
Condition: Audio prompt + text prompt
Target: Audio

jameshuntercarter

Mar 7

@OpenSound Can this be done with new inference code, using the existing models, or would it require training new models?