disi-unibo-nlp
/

pmc-llama-13b-awq

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

pmc-llama-13b-awq / README.md

alecocc's picture

Update README.md

d7a5687 verified 9 months ago

|

1.3 kB

	---
	license: openrail
	model_creator: axiong
	model_name: PMC_LLaMA_13B
	---
	# PMC_LLaMA_13B - AWQ
	- Model creator: [axiong](https://huggingface.co/axiong)
	- Original model: [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B)

	## Description

	This repo contains AWQ model files for [PMC_LLaMA_13B](https://huggingface.co/axiong/PMC_LLaMA_13B).

	### About AWQ

	AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings.

	- When using vLLM from Python code, again set `quantization=awq`.

	For example:

	```python
	from vllm import LLM, SamplingParams

	prompts = [
	"What is the mechanism of action of antibiotics?"
	"How do statins work to lower cholesterol levels?",
	"Tell me about Paracetamol",
	]

	'''


	sampling_params = SamplingParams(temperature=0.8)

	llm = LLM(model="axiong/PMC_LLaMA_13B", quantization="awq", dtype="half")

	outputs = llm.generate(prompts, sampling_params)

	# Print the outputs.
	for output in outputs:
	prompt = output.prompt
	generated_text = output.outputs[0].text
	print(f"Prompt: {prompt}")
	print(f"Response: {generated_text}")
	```