Edit model card

PMC_LLaMA_13B - AWQ

Description

This repository contains AWQ model files for PMC_LLaMA_13B.

About AWQ

Activation-aware Weight Quantization (AWQ) selectively preserves a subset of crucial weights for LLM performance instead of quantizing all weights in a model. This targeted approach minimizes quantization loss, allowing models to operate in 4-bit precision without compromising performance.

Example of usage with vLLM library:

from vllm import LLM, SamplingParams

prompt_input = (
    '### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:'
)
 
examples = [
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "What is the mechanism of action of antibiotics?"
    },
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "How do statins work to lower cholesterol levels?"
    },
    {
      "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.",
      "input": "Tell me about Paracetamol"
    }
]
 
prompt_batch = [prompt_input.format_map(example) for example in examples]

sampling_params = SamplingParams(temperature=0.8, max_tokens=512)

llm = LLM(model="disi-unibo-nlp/pmc-llama-13b-awq", quantization="awq", dtype="half")

outputs = llm.generate(prompt_batch, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(generated_text)
Downloads last month
0
Safetensors
Model size
2.03B params
Tensor type
I32
·
FP16
·
Inference API
This model can be loaded on Inference API (serverless).

Collection including disi-unibo-nlp/pmc-llama-13b-awq