File size: 866 Bytes
7dcb561
 
 
42c45e1
 
 
 
 
 
69f1f49
aa0ac4c
 
42c45e1
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
---
license: cc-by-4.0
---

Experimental quantization. 

Working inference code (regular inference with autogptq does not work without return_token_type_ids=False, didn't get it to work with textgen-webui): 

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig 

from transformers import AutoTokenizer, TextGenerationPipeline 


tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, use_fast=True) 

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=False)  

input_ids = tokenizer("Question: What is the purpose of life?\n\nAnswer:", return_tensors="pt").input_ids.to("cuda:0")

out = model.generate(input_ids=input_ids,max_length=300)

print(tokenizer.decode(out[0]))

or 

print(tokenizer.decode(model.generate(**tokenizer("test is", return_tensors="pt", return_token_type_ids=False).to("cuda:0"))[0]))