facebook/musicgen-large · Runtime is about 2x slower than with Meta's own audiocraft code

Aug 3, 2023

The local gradio demo provided by Meta here using the non-HF weights is about 2x faster than the HF weights. Worse, bitsandbytes quantization results in 3-4x slower inference when it should be faster. Looks like the Transformers implementation still needs some work.

reach-vb

Aug 3, 2023

Hi @lemonflourorange - Sorry to hear that. Can you please share the inference code you are using?

sanchit-gandhi

Aug 3, 2023

•

edited Aug 3, 2023

Hey @lemonflourorange ! Thanks for opening this issue. Note that the default Meta implementation uses fp16 by default, whereas transformers uses fp32. You can put the model in fp16 precision by calling:

model.half()

This should give you a nice speed-up versus full fp32 precision.

Note that bitsandbytes quantisation is expected to be slower than fp16; 3-4x slower is about what we'd expect for dynamic 8-bit quantisation (this will be lower for dynamic 4-bit quantisation). See results for Whisper ASR, which tests models for a similar model size for the speech recognition task: https://github.com/huggingface/peft/discussions/477

lemonflourorange

Aug 7, 2023

Thanks. Half precision fixes this for me. Still not sure why quantization ends up being slower than fp16. I guess quantization only improves inference speed in LLMs?

sanchit-gandhi

Sep 6, 2023

Indeed - we have (relatively) small matmuls, and large inputs, which causes the 8-bit bnb algorithm to be quite slow for MusicGen. You can try using the latest 4-bit algorithm which should be faster: https://huggingface.co/blog/4bit-transformers-bitsandbytes