--- tags: - gptq - 4bit - gptqmodel - modelcloud - gemma2 --- **This model has been quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel).** - bits: 4 - group_size: 128 - desc_act: true - static_groups: false - sym: true - lm_head: false - damp_percent: 0.01 - true_sequential: true - model_name_or_path: "" - model_file_base_name: "model" - quant_method: "gptq" - checkpoint_format: "gptq" - metaļ¼š - quantizer: "gptqmodel:0.9.9-dev0" **Currently, only vllm can load the quantized gemma2-27b for proper inference. Here is an example:** ```python import os # Gemma-2 use Flashinfer backend for models with logits_soft_cap. Otherwise, the output might be wrong. os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' from transformers import AutoTokenizer from gptqmodel import BACKEND, GPTQModel model_name = "ModelCloud/gemma-2-27b-it-gptq-4bit" prompt = [{"role": "user", "content": "I am in Shanghai, preparing to visit the natural history museum. Can you tell me the best way to"}] tokenizer = AutoTokenizer.from_pretrained(model_name) model = GPTQModel.from_quantized( model_name, backend=BACKEND.VLLM, ) inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) outputs = model.generate(prompts=inputs, temperature=0.95, max_length=128) print(outputs[0].outputs[0].text) ```