Anybody got it to run a quantized version with vLLM

#24

by alecauduro - opened Feb 5

Feb 5

I'm not having luck getting the quantized versions (unsloth or awq) to work with vLLM.

Feb 5

I completed the W8A8 quantization of its abliterated version and used vllm inference, everything worked fine on a dual card 2080ti-22G.

Feb 6

stelterlab/Mistral-Small-24B-Instruct-2501-AWQ worked for me with a 4090

Feb 6

I was able to get it running, was missing the --enforce-eager parameter.
Now I'm trying to figure out why function calling doesn't work.

mistral_common.exceptions.InvalidMessageStructureException: Unexpected role 'system' after role 'tool'

Feb 6

ok, it was just a question of changing the order for system prompt to come first. Exceptional local model!

13 days ago

@alecauduro , how'd you get it working with vLLM? What were the flags that you passed and which GPU did you get it running on?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment