alecocc commited on
Commit
d7a5687
1 Parent(s): 3df32f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -0
README.md CHANGED
@@ -14,3 +14,33 @@ This repo contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/ax
14
  ### About AWQ
15
 
16
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ### About AWQ
15
 
16
  AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
17
+
18
+ - When using vLLM from Python code, again set `quantization=awq`.
19
+
20
+ For example:
21
+
22
+ ```python
23
+ from vllm import LLM, SamplingParams
24
+
25
+ prompts = [
26
+ "What is the mechanism of action of antibiotics?"
27
+ "How do statins work to lower cholesterol levels?",
28
+ "Tell me about Paracetamol",
29
+ ]
30
+
31
+ '''
32
+
33
+
34
+ sampling_params = SamplingParams(temperature=0.8)
35
+
36
+ llm = LLM(model="axiong/PMC_LLaMA_13B", quantization="awq", dtype="half")
37
+
38
+ outputs = llm.generate(prompts, sampling_params)
39
+
40
+ # Print the outputs.
41
+ for output in outputs:
42
+ prompt = output.prompt
43
+ generated_text = output.outputs[0].text
44
+ print(f"Prompt: {prompt}")
45
+ print(f"Response: {generated_text}")
46
+ ```