Quantized Versions?
This model is just out of reach for most consumers with 24gb VRAM. Will you be providing any quantized versions?
Same
Normally others in the community do that if you are unable to do it yourself. Takes a little bit of time i have noticed, but it does happen.
( ok so im wrong on timing, its already been done. a quick search found a couple of people who have done it already )
Normally others in the community do that if you are unable to do it yourself. Takes a little bit of time i have noticed, but it does happen.
( ok so im wrong on timing, its already been done. a quick search found a couple of people who have done it already )
Really? All I'm seeing is a few GGUFs which are unable to be used for inference at all.
i didn't try them, but just that i noticed it was already happening.
I tried running it on A100(40GB). Still out of memory! :(
was testing and it took around ~60GB VRAM running the example code
60GB is crazy. Does it have any noticably better results than Flux?
Hi
@sonam-shrish
,
Correct me if I’m wrong but Pixtral is a vision-language model while FLUX is an text-to-image model. Meaning the comparison would be like comparing apples and pears.
Best,
M
It appears all quantization methods in VLLM rely on transformers.
GGUF version?
Hi
@mikehemberger
,
you're right. I didn't know that Pixtral was only a VLM and does not generate any image.
I did not read the model card properly.
Thanks :)
Best,
Sonam
You‘re welcome
@sonam-shrish
,
I hope at some point mistral AI will also tackle text-to-image, though ;-)
Yeah, that would be cool.
Hey guys,
idk if I am just stupid rn, but where did you find quantized versions?
Hey guys,
idk if I am just stupid rn, but where did you find quantized versions?
I just did a simple search at the top of the HF page. Tho as others have mentioned, they may not be ready for prime time yet.
Please stop forcing people to buy higher nvidia cards!
Please stop forcing people to buy higher nvidia cards!
We all hate to be on the GPU treadmill, but if you want to move forward, larger is just going to be the way. Hopefully in the near future shared ram NPUs will be able to take the strain of many of these larger models and we can skip the GPU.
Hopefully, not
Hopefully, not
You hope we don't get cheaper hardware that can take the place of GPU? You work for NVIDIA or something?
Oh, sorry, I misread it! I mean, hopefully, we’ll get the better GPU/NPU, or whatever, but not the SHARED GPU (like global miners, I saw in some interview)
There is a transformers compatible version of this model - mistral-community/pixtral-12b
I've been able to load this in 4 bit quant, using about 10gb of vram:
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16
)
model_id = "mistral-community/pixtral-12b"
model = LlavaForConditionalGeneration.from_pretrained(model_id,
torch_dtype=torch.float16,
device_map='cuda',
quantization_config=quantization_config)
processor = AutoProcessor.from_pretrained(model_id)
Also a vllm pr - https://github.com/vllm-project/vllm/pull/9036