File size: 1,298 Bytes
ad2fd34
7c372d2
5df075b
 
c69c8f9
3df32f3
c69c8f9
 
 
 
 
 
 
 
 
 
d7a5687
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: openrail
model_creator: axiong
model_name: PMC_LLaMA_13B
---
# PMC_LLaMA_13B - AWQ
- Model creator: [axiong](https://huggingface.co/axiong)
- Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B)

## Description

This repo contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B).

### About AWQ

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

- When using vLLM from Python code, again set `quantization=awq`.

For example:

```python
from vllm import LLM, SamplingParams

prompts = [
    "What is the mechanism of action of antibiotics?"
    "How do statins work to lower cholesterol levels?",
    "Tell me about Paracetamol",
]

'''


sampling_params = SamplingParams(temperature=0.8)

llm = LLM(model="axiong/PMC_LLaMA_13B", quantization="awq", dtype="half")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Response: {generated_text}")
```