pmc-llama-13b-awq / README.md
alecocc's picture
Update README.md
d7a5687 verified
|
raw
history blame
1.3 kB
---
license: openrail
model_creator: axiong
model_name: PMC_LLaMA_13B
---
# PMC_LLaMA_13B - AWQ
- Model creator: [axiong](https://huggingface.co/axiong)
- Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B)
## Description
This repo contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B).
### About AWQ
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.
- When using vLLM from Python code, again set `quantization=awq`.
For example:
```python
from vllm import LLM, SamplingParams
prompts = [
"What is the mechanism of action of antibiotics?"
"How do statins work to lower cholesterol levels?",
"Tell me about Paracetamol",
]
'''
sampling_params = SamplingParams(temperature=0.8)
llm = LLM(model="axiong/PMC_LLaMA_13B", quantization="awq", dtype="half")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}")
print(f"Response: {generated_text}")
```