PerryCheng614 commited on
Commit
92d2fec
·
verified ·
1 Parent(s): 5838923

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Running the Quantized Model
2
+
3
+ This repository contains a quantized version of the model, optimized for efficient inference while maintaining performance.
4
+
5
+ ## Requirements
6
+
7
+ ```bash
8
+ pip install auto-gptq
9
+ pip install transformers
10
+ ```
11
+
12
+ ## Usage
13
+
14
+ You can run the quantized model using the provided script. The script handles all the necessary setup and inference pipeline.
15
+
16
+ Script:
17
+
18
+ ```python
19
+ import argparse
20
+ from transformers import AutoTokenizer
21
+ from auto_gptq import AutoGPTQForCausalLM
22
+
23
+ def run_inference(model_repo_id):
24
+ # load tokenizer
25
+ tokenizer = AutoTokenizer.from_pretrained(model_repo_id, trust_remote_code=True, device="cuda")
26
+
27
+ # load quantized model to the first GPU
28
+ model = AutoGPTQForCausalLM.from_quantized(model_repo_id, device="cuda:0")
29
+
30
+ # Using the same prompt format as in load_data
31
+ prompt = "Tell me a story of 100 words."
32
+
33
+ # Apply chat template if available
34
+ if hasattr(tokenizer, 'apply_chat_template'):
35
+ messages = [{"role": "user", "content": prompt}]
36
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False)
37
+
38
+ # Check if prompt length is within limits
39
+ if len(tokenizer(prompt)["input_ids"]) >= tokenizer.model_max_length:
40
+ raise ValueError("Prompt is too long for the model's maximum length")
41
+
42
+ # Tokenize and generate
43
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
44
+ outputs = model.generate(
45
+ **inputs,
46
+ pad_token_id=tokenizer.pad_token_id,
47
+ eos_token_id=tokenizer.eos_token_id,
48
+ max_new_tokens=512,
49
+ )
50
+
51
+ # Decode and print the result
52
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
53
+ print(generated_text)
54
+
55
+ if __name__ == "__main__":
56
+ parser = argparse.ArgumentParser(description="Run inference with a quantized model")
57
+ parser.add_argument("model_repo_id", type=str, help="The model repository ID or path")
58
+ args = parser.parse_args()
59
+
60
+ run_inference(args.model_repo_id)
61
+ ```
62
+
63
+ ### Basic Usage
64
+
65
+ ```bash
66
+ python run_quantized_model.py MODEL_REPO_ID
67
+ ```
68
+
69
+ Replace `MODEL_REPO_ID` with either:
70
+ - The Hugging Face model repository ID (e.g., "TheBloke/Llama-2-7B-GPTQ")
71
+ - A local path to the model
72
+
73
+ ### Example
74
+
75
+ ```bash
76
+ python run_quantized_model.py TheBloke/Llama-2-7B-GPTQ
77
+ ```