# Running the Quantized Model This repository contains a quantized version of the model, optimized for efficient inference while maintaining performance. ## Requirements ```bash pip install auto-gptq pip install transformers ``` ## Usage You can run the quantized model using the provided script. The script handles all the necessary setup and inference pipeline. Script: ```python import argparse from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM def run_inference(model_repo_id): # load tokenizer tokenizer = AutoTokenizer.from_pretrained(model_repo_id, trust_remote_code=True, device="cuda") # load quantized model to the first GPU model = AutoGPTQForCausalLM.from_quantized(model_repo_id, device="cuda:0") # Using the same prompt format as in load_data prompt = "Tell me a story of 100 words." # Apply chat template if available if hasattr(tokenizer, 'apply_chat_template'): messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False) # Check if prompt length is within limits if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length: raise ValueError("Prompt is too long for the model's maximum length") # Tokenize and generate inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=512, ) # Decode and print the result generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False) print(generated_text) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Run inference with a quantized model") parser.add_argument("model_repo_id", type=str, help="The model repository ID or path") args = parser.parse_args() run_inference(args.model_repo_id) ``` ### Basic Usage ```bash python run_quantized_model.py MODEL_REPO_ID ``` Replace `MODEL_REPO_ID` with either: - The Hugging Face model repository ID (e.g., "TheBloke/Llama-2-7B-GPTQ") - A local path to the model ### Example ```bash python run_quantized_model.py TheBloke/Llama-2-7B-GPTQ ```