Inference taking so long

#14
by J812 - opened

Hi, thank you for this great model.
I am trying to get some predictions on a picture locally on my PC/CPU, I don't have GPU.
& it's taking like 1h30 and still no prediction, is this expected?

Llava Hugging Face org

Hey! It's recommended to do inference with these models on a GPU. For CPU I think it's expected since it's a 34B model and also llava-1.6 has more sub-images per input compared to other VLMs. From my experience, I once ran a 7B model on CPU and waited for around 30-40 min

If you cannot fit 34B model on small GPUs, I recommend to take advantage of different optimization methods we have. For example, load model in 4 bits with bitsandbytes (docs here) or use Flash-Attention for long-context sequences. Hope this helps!

Llava Hugging Face org

it will take forever to infer if you use 4-bit inference with 34B on a T4, because it quantizes/dequantizes the weights on the fly so I do not recommend that. there's no free lunch to be honest when it comes to fitting in 34B models to smaller hardware.

Sign up or log in to comment