Quant of https://huggingface.co/junelee/wizard-vicuna-13b tested working with Occam's KoboldAI/GPTQ.

Someone made a Triton quant already here, but it will not work with Occam's KoboldAI/GPTQ fork: https://huggingface.co/fbjr/wizard-vicuna-13b-4bit-128g

Note that this model is fairly heavily censored (in my opinion) and delivers AI-moralizing responses to prompts that Vicuna 1.1 does not complain about.

python llama.py ./wizard-vicuna-13b c4 --wbits 4 --true-sequential --groupsize 128 --save_safetensors 4bit-128g.safetensors

Downloads last month
12
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using tsumeone/wizard-vicuna-13b-4bit-128g-cuda 1