Anybody got it to run a quantized version with vLLM

#24
by alecauduro - opened

I'm not having luck getting the quantized versions (unsloth or awq) to work with vLLM.

I completed the W8A8 quantization of its abliterated version and used vllm inference, everything worked fine on a dual card 2080ti-22G.

stelterlab/Mistral-Small-24B-Instruct-2501-AWQ worked for me with a 4090

I was able to get it running, was missing the --enforce-eager parameter.
Now I'm trying to figure out why function calling doesn't work.

mistral_common.exceptions.InvalidMessageStructureException: Unexpected role 'system' after role 'tool'

ok, it was just a question of changing the order for system prompt to come first. Exceptional local model!

@alecauduro , how'd you get it working with vLLM? What were the flags that you passed and which GPU did you get it running on?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment