This model has been quantized using [GPTQModel](https://github.com/ModelCloud/GPTQModel). - **bits**: 4 - **group_size**: 128 - **desc_act**: true - **static_groups**: false - **sym**: true - **lm_head**: false - **damp_percent**: 0.01 - **true_sequential**: true - **model_name_or_path**: "" - **model_file_base_name**: "model" - **quant_method**: "gptq" - **checkpoint_format**: "gptq" - **meta**: - **quantizer**: "gptqmodel:0.9.9-dev0" You can use [GPTQModel](https://github.com/ModelCloud/GPTQModel) for model inference. ```python import torch from transformers import AutoTokenizer, GenerationConfig from gptqmodel import GPTQModel model_name = "ModelCloud/DeepSeek-V2-Chat-0628-gptq-4bit" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) # `max_memory` should be set based on your devices max_memory = {i: "75GB" for i in range(2)} # `device_map` cannot be set to `auto` model = GPTQModel.from_quantized(model_name, trust_remote_code=True, device_map="sequential", max_memory=max_memory, torch_dtype=torch.float16, attn_implementation="eager") model.generation_config = GenerationConfig.from_pretrained(model_name) model.generation_config.pad_token_id = model.generation_config.eos_token_id messages = [ {"role": "user", "content": "Write a piece of quicksort code in C++"} ] input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") outputs = model.generate(input_ids=input_tensor.to(model.device), max_new_tokens=100) result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True) print(result) ```