RuntimeError: expected scalar type Float but found Half
The model seems to load fine, but trying to generate text with it just throws me "RuntimeError: expected scalar type Float but found Half"
Any idea what this could be? I loaded the model as bfloat16.
Edit: my bad. I loaded the model as OPT instead of GPTJ into Oogabooga. >_<
Has anybody gotten this model to load in oogabooga? I'm also getting the "RuntimeError: expected scalar type Float but found Half"
Using --wbits 4 --groupsize 128 (no model_type given)
I've seen other people use model_type as gptj but it performs slow as hell.
I'm having no problem loading MetaIX/GPT4-X-Alpaca-30B-Int4/gpt4-x-alpaca-30b-128g-4bit.safetensors with just --wbits 4 --groupsize 128.
I think that is just what you have to deal with. 30 Billion parameters are a LOT of data. I am running this on a 4090 and I get around 0.4 tokens/s, while a 13B model gives me more like 9 tokens/s.
Jakxx.. the only problem is that my other 30 Billion parameter model performs ok on a 3090:
Found the following quantized model: models/GPT4-X-Alpaca-30B-Int4/gpt4-x-alpaca-30b-128g-4bit.safetensors
Loading model ...
Done.
Loaded the model in 18.39 seconds.
Output generated in 9.74 seconds (6.26 tokens/s, 61 tokens, context 218, seed 3259645)
Try MetaIX/GPT4-X-Alpaca-30B-Int4 on your 4090.
I believe this model was quantized with a funny branch of GPTQ-for-LLaMa. Most people are using the standard GPTQ-for-LLaMa with text-generation-webui. Hopefully, we'll see another version of this model quantized.
Aah well here's to hoping.