--- license: openrail model_creator: axiong model_name: PMC_LLaMA_13B --- # PMC_LLaMA_13B - AWQ - Model creator: [axiong](https://huggingface.co/axiong) - Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B) ## Description This repository contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B). ### About AWQ [Activation-aware Weight Quantization (AWQ)](https://arxiv.org/abs/2306.00978) selectively preserves a subset of crucial weights for LLM performance instead of quantizing all weights in a model. This targeted approach minimizes quantization loss, allowing models to operate in 4-bit precision without compromising performance. Example of usage with vLLM library: ```python from vllm import LLM, SamplingParams prompt_input = ( '### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:' ) examples = [ { "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.", "input": "What is the mechanism of action of antibiotics?" }, { "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.", "input": "How do statins work to lower cholesterol levels?" }, { "instruction": "You're a doctor, kindly address the medical queries according to the patient's account. Answer the question.", "input": "Tell me about Paracetamol" } ] prompt_batch = [prompt_input.format_map(example) for example in examples] sampling_params = SamplingParams(temperature=0.8, max_tokens=512) llm = LLM(model="disi-unibo-nlp/pmc-llama-13b-awq", quantization="awq", dtype="half") outputs = llm.generate(prompt_batch, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt}") print(generated_text) ```