Not able to generate answer from astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit

#1
by ceefour - opened

Hi!

Thank you for uploading this quantized model, as it allows me to use Llama 3 from Google Colab, as otherwise it's not possible because the original model is too big to fit in Nvidia T4.

I use the following code to load model and generate text:

from transformers import AutoTokenizer, pipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_id = "astronomer/Llama-3-8B-Instruct-GPTQ-4-Bit"

print('Creating QuantizeConfig...')

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        desc_act=False
    )

print('Loading Quantized model...')

model = AutoGPTQForCausalLM.from_quantized(
        model_id,
        use_safetensors=True,
        device="cuda:0",
        quantize_config=quantize_config)

print('Loading Tokenizer model...')

tokenizer = AutoTokenizer.from_pretrained(model_id)

print('Creating Pipeline...')

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)
rompt = "What is the capital of Indonesia?"

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": prompt},
]

outputs = pipe(messages,
               max_new_tokens=256,
               eos_token_id=terminators,
               do_sample=True,
               temperature=0.6,
               top_p=0.9)
print(outputs[0]["generated_text"][-1])

However the output is:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '】,【\x00\x0c】,【\\",\\"gers\x00】,\x00`,{"gorithms`\n\n`.\n\n»\n\n»\n\n`,`},»\n\n`\n\n}\n\n».\n\n`]("><}[],](`\n\n],[`,]]["],"),\n\n」\n\n},{"]\n\n`,}\n\n}\n\n%),»\n\n],}\n\n`\n\n)</」\n\n```\n\n`,`,\n`\n\n`\n\n`.\n\n},{{"`\n\n`.\n\n`,>`{"{"},`,}\n\n</],\n\n{"],"}`>\n\n».\n\n».\n\n)`]\n\n%`,»{"}\n\n"`].\n\n`\n\n}\n\n])\n\n}`,`\n\n`\n\n«],{"`,`\n\n`\n\nAE)`]\n\n}\n\n``,»\n\n`,.\n\n`.\n\n`,`\n\n}\n\n>,"""\n\n]>`,{"`,)``,``»`,)\n\n\n`,{"`,{"}\n\n))\n\n]({"`,]]`\n\n\n\n\n\n\n\n\n\n%,``.\n\n\n\n\n`,\n\n\n`\n\n`\n\n.\n\n\n``\n\n`\n\n]\n\n`,]( "\n\n\n.\n\n\n#](}}`\n\n<<{"))\n\n%)`%\n\n\n.\n\n\n\n\n\n`]`,]]}\n\n.\n\n\n`\n\nD<<``][B»!\n\n>\n\n`\n\n`,]]¢]\n\n`\n\n\n\n``»\n\n\n`\n\n\n\n]]\n\n}\n\n%]][<``]\n\n.\n\n\n\n\n<<=`]`,{}`\n\n\n\n[/`\n\n¢}=)\n\n.\n\n}\n\n}\n\n\n\n](%}<AE}`\n\n``»>\n\n%G``\n\n\n\n\nAE\n\n\n]\n\n"""\n\n)<\n\n\n\n\n`\n\n]][](\n\n#%}>\n\n``>\n\n\n\n»`\n\n`\n\n'}

I used a different system message:

    {"role": "system", "content": "You are a helpful assistant."},

with similar garbled output:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '\x00】,【\x00`,\x00\x00\\"><},},{\x00`\n\n\\"],」\n\n`.\n\n»\n\n\\",\\"】,【},`](«】,【],[】,【»\n\n},`\n\n](}\n\n»\n\n}[`,``,},"></«"""\n\n]][».\n\n`.\n\n»,%),"`},`]\n\n`,`,»`,`,»`,"],"},]\n\n},}`>``,]{.\n\n`.\n\n},"\n\n].\n\n`\n\n],"},],\n\n],%,},{"%,]]`,`\n\n{}`\n\n%```,.\n\n»]\n\n`,,"»%`,},]\n\n»},.\n\n\n`,}\n\n`\n\n`.\n\n]\n\n},}%``\n\n\n.\n\n\n``.\n\nologists},>`\n\n\n``\n\n<<AE`,`\n\n`,.\n\n`,```,>,""``,»}](`\n\n€`,>\n\n{"F«{"`,}]\n\n`\n\n\n\n\n]\n\n)``\n\n>`»,``}}\n\n\n\n<<\n\n\n]]`.\n\n}\n\n`\n\nG\n\n\n`\n\n},{"]]},`\n\n`,]>`\n\n\n`AE`\n\n\n\n\nAE<<))\n\n\n\n`,<<\n\n<B "\n\n),»\n\n\n\n`}\n\n#F]AE`\n\n}\n\n`,\n\n\n</]][\n\n\n]][],<<],¢\n\n\n=`\n\n,{{\n\n\n}\n\n\n\n\n\n.\n\n<<\x00\n\n\n%]]\n\n\n\n\n\n\n»`A]][.\n\n\n\n\n]]\n\n)\n\n`\n\n\n\n"""\n\n\n\n]]\n\n</{"```\n\n\n@C\n\n\n`\n\n</]]`>\n\n\n\n\n\n\n\n\n'}

A simple string prompt:

# prompt = "What is a large language model?"
prompt = "What is the capital of Indonesia?"

terminators = [
    pipe.tokenizer.eos_token_id,
    pipe.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipe(prompt,
               max_new_tokens=256,
               eos_token_id=terminators,
               do_sample=True,
               temperature=0.6,
               top_p=0.9)
print(outputs[0]["generated_text"][-1])

And a single output:

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
s

Can you suggest where I did wrong?

Let me try to see if I can reproduce this on my end. However, I think this maybe a known bug with the AutoGPTQ library and huggingface transformers. I think their integration broke at some point near 4.40 release of transformers.

Can you leave a comment here describing what happened in this github issue https://github.com/AutoGPTQ/AutoGPTQ/issues/657? Someone familiar with both transformers and AutoGPTQ would need to do a deepdive to resolve this. The more people commenting on the github issue the more likely the major maintainers will try to help resolve it.

In the mean time, I would suggest just loading the model in vLLM or any serving engine that doesn't directly use huggingface transformers under the hood for generation should work.

Sign up or log in to comment