How much does it take to inference one sample?

#48
by andreaKIM - opened

Hello I am testing this backbone model for fine tuning my custom dataset. Before I start, i just wanted to check how much does it take to generate one prediction from GPTQ model. Sicne I am first to use GPTQ model, I am stuck on some running error with dependncy. Below is my test script.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import time

model_name_or_path = "TheBloke/Llama-2-13B-chat-GPTQ"
# To use a different branch, change revision
# For example: revision="main"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

prompt = "Tell me about AI"
prompt_template=f'''[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
{prompt}[/INST]

'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
start_time = time.time()
output = model.generate(inputs=input_ids, temperature=0.7, do_sample=True, top_p=0.95, top_k=40, max_new_tokens=512)
print(tokenizer.decode(output[0]))
print("total time:",time.time()-start_time)
print("number of token generated:",len(tokenizer.decode(output[0]).split(" ")))
# Inference can also be done using transformers' pipeline

start_time_p = time.time()
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)
print(pipe(prompt_template)[0]['generated_text'])
print("total time:",time.time()-start_time_p)
print("number of token generated:",len(pipe(prompt_template)[0]['generated_text'].split(" ")))

Unfortunately whenever i try this code i got ImportError like this.

CUDA extension not installed.
CUDA extension not installed.
exllama_kernels not installed.
Traceback (most recent call last):
...(skip)
    from exllama_kernels import make_q4, q4_matmul
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

I guess i am missing simple point but i have no idea how to break it.
Can anybody solve this problem? Or I just want to know how fast GPTQ models are. Thanks :)

Sign up or log in to comment