Inference generation extremely slow

#57

by aledane - opened Dec 20, 2023

Discussion

aledane

Dec 20, 2023

•

edited Dec 20, 2023

Hi,
I am using the model in the quantization version with this setting:

                "params" : {
                            "trust_remote_code" : True,
                            "torch_dtype":torch.bfloat16,
                            "return_full_text" : True,
                            "device_map" : "auto",
                            "max_new_tokens" : 16,
                            "do_sample" : True,
                            "temperature" : 0.01,
                            "renormalize_logits" : True
                           },

However, in inference the model is extremely slow (it is running for 1 hour for a simple question).
I am using the model on a g5.4xlarge Sagemaker instance (16gb vcpu, 64gb RAM, NVIDIA A10 GPU)
Any idea on how to speed up the process? Thanks

ybelkada

Dec 20, 2023

Hi @aledane

I am using the model in the quantization version

Can you elaborate more? Which quantization method are you using?

aledane

Dec 20, 2023

I was trying to use this: https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ
However, I realized I was not really using it due to a coding mistake; instead, I was deploying the original version mistralai/Mixtral-8x7B-Instruct-v0.1.

Other than using a quantization method, is there any way to speed up the inference generation by using the original model?
Is it just a problem of resources (so I have to increase the Sagemaker instance), or is there another way?

seabasshn

Dec 20, 2023

Can you share your code and versions of sagemaker sdk and the TGI you used? I've been trying to deploy both the models on SM but I havent been able to.

ybelkada

Dec 20, 2023

Hi @aledane
I suspect your model is silently loaded with CPU offloading because you don't have enough GPU RAM. You can make sure to use torch.float16 by passing torch_dtype=torch.float16 in from_pretrained, or load the model in 4-bit precision through bitsandbytes package so that your model will fit into your GPU

aledane

Dec 20, 2023

Hi @ybelkada , thank you for your reply. I am already using torch.float16 in from_pretrained as I have already shown in the set of parameters above.
I can try with 4-bit precision tho, even if I do not think will change so much honestly

MrDragonFox

Dec 21, 2023

you will need more then 1 a10 for that .. .fp16 takes about 90g vram in so 2 a100/h100 2 a6000 are fine either way
6bpw on exlv2 takes 38g so you can cramp that into an a6000

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment