YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Running the Quantized Model
This repository contains a quantized version of the model, optimized for efficient inference while maintaining performance.
Requirements
pip install auto-gptq
pip install transformers
Usage
You can run the quantized model using the provided script. The script handles all the necessary setup and inference pipeline.
Script:
import argparse
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM
def run_inference(model_repo_id):
# load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_repo_id, trust_remote_code=True, device="cuda")
# load quantized model to the first GPU
model = AutoGPTQForCausalLM.from_quantized(model_repo_id, device="cuda:0")
# Using the same prompt format as in load_data
prompt = "Tell me a story of 100 words."
# Apply chat template if available
if hasattr(tokenizer, 'apply_chat_template'):
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
# Check if prompt length is within limits
if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length:
raise ValueError("Prompt is too long for the model's maximum length")
# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
max_new_tokens=512,
)
# Decode and print the result
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(generated_text)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run inference with a quantized model")
parser.add_argument("model_repo_id", type=str, help="The model repository ID or path")
args = parser.parse_args()
run_inference(args.model_repo_id)
Basic Usage
python run_quantized_model.py MODEL_REPO_ID
Replace MODEL_REPO_ID
with either:
- The Hugging Face model repository ID (e.g., "TheBloke/Llama-2-7B-GPTQ")
- A local path to the model
Example
python run_quantized_model.py TheBloke/Llama-2-7B-GPTQ
- Downloads last month
- 18