PrunaAI/gradientai-Llama-3-8B-Instruct-262k-AWQ-4bit-smashed

May 2, 2024

Hi, I'm trying to test this model but experienced some issues:

The example code to load & run the model doesn't seem to work.
After making a few changes, I made it work but the inference performance is very slow, only a few tokens/sec on my 3090. What's your experience?

#model = AutoAWQForCausalLM.from_quantized("PrunaAI/gradientai-Llama-3-8B-Instruct-262k-AWQ-4bit-smashed", trust_remote_code=True, device_map='auto')
model = AutoAWQForCausalLM.from_pretrained("PrunaAI/gradientai-Llama-3-8B-Instruct-262k-AWQ-4bit-smashed", trust_remote_code=True, device_map='auto')
#input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to('cuda')["input_ids"]

3)Why doesn't the model include the tokenizer directly? What's the process to run this model using Hugging Face's TGI? I tried several ways to borrow tokenizer from other models and integrate with TGI, all failed.

sharpenb

Pruna AI org May 2, 2024

Hi Richard,

Thank for notifying us! Indeed both model and tokenizer should be on cuda. We made that clear in the readme. You can use the same tokenizer as the base model. For convenience, we added an embedded tokenizer (copied from the base repo) in this repo as well. We will propagate this change to other relevant LLM repos.

richardxgf

May 2, 2024

•

edited May 2, 2024

Thank you very much, and I can easily integrate the model with TGI now.

sharpenb

Pruna AI org May 2, 2024

That is amazing :)

PrunaAI
/

gradientai-Llama-3-8B-Instruct-262k-AWQ-4bit-smashed

🚩 Report