Anybody got it to run a quantized version with vLLM
I'm not having luck getting the quantized versions (unsloth or awq) to work with vLLM.
I completed the W8A8 quantization of its abliterated version and used vllm inference, everything worked fine on a dual card 2080ti-22G.
stelterlab/Mistral-Small-24B-Instruct-2501-AWQ worked for me with a 4090
I was able to get it running, was missing the --enforce-eager parameter.
Now I'm trying to figure out why function calling doesn't work.
mistral_common.exceptions.InvalidMessageStructureException: Unexpected role 'system' after role 'tool'
ok, it was just a question of changing the order for system prompt to come first. Exceptional local model!
@alecauduro , how'd you get it working with vLLM? What were the flags that you passed and which GPU did you get it running on?