|
--- |
|
license: openrail |
|
model_creator: axiong |
|
model_name: PMC_LLaMA_13B |
|
--- |
|
# PMC_LLaMA_13B - AWQ |
|
- Model creator: [axiong](https://huggingface.co/axiong) |
|
- Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B) |
|
|
|
## Description |
|
|
|
This repo contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B). |
|
|
|
### About AWQ |
|
|
|
AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. |
|
|
|
- When using vLLM from Python code, again set `quantization=awq`. |
|
|
|
For example: |
|
|
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
prompts = [ |
|
"What is the mechanism of action of antibiotics?" |
|
"How do statins work to lower cholesterol levels?", |
|
"Tell me about Paracetamol", |
|
] |
|
|
|
''' |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.8) |
|
|
|
llm = LLM(model="axiong/PMC_LLaMA_13B", quantization="awq", dtype="half") |
|
|
|
outputs = llm.generate(prompts, sampling_params) |
|
|
|
# Print the outputs. |
|
for output in outputs: |
|
prompt = output.prompt |
|
generated_text = output.outputs[0].text |
|
print(f"Prompt: {prompt}") |
|
print(f"Response: {generated_text}") |
|
``` |