tctsung/TinyLlama-1.1B-chat-v1.0-awq

This model is quantized by autoawq package using tctsung/chat_restaurant_recommendation as calibration dataset

Reference model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Key results:

AWQ quantization resulted in a 1.62x improvement in inference speed, generating 140.47 new tokens per second.
The model size was compressed from 4.4GB to 0.78GB, representing a reduction in memory footprint to only 17.57% of the original model.
I used 6 different LLM tasks to demonstrate that the quantized model maintains similar accuracy, with a maximum accuracy degradation of only ~1%

For more details, see github repo tctsung/LLM_quantize

Inference tutorial

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# load model & tokenizer:
model_id = "tctsung/TinyLlama-1.1B-chat-v1.0-awq"
model = LLM(model = model_id, dtype='half', 
            quantization='awq', gpu_memory_utilization=0.9)
sampling_params = SamplingParams(temperature=1.0,
                                 max_tokens=1024,
                                 min_p=0.5,
                                 top_p=0.85)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# define your own sys & user msg:
sys_msg = "..."
user_msg = "..."
chat_msg = [
            {"role": "system", "content": sys_msg},
            {"role": "user",  "content": user_msg}
        ]
input_text = tokenizer.apply_chat_template(chat_msg, tokenize=False, add_generation_prompt=False)  
output = model.generate(input_text, sampling_params)
output_text = output[0].outputs[0].text
print(output_text)   # show the model output

tctsung
/

TinyLlama-1.1B-chat-v1.0-awq

Key results:

Inference tutorial

Dataset used to train tctsung/TinyLlama-1.1B-chat-v1.0-awq