Slow inference

#33
by BigArt - opened

In 40B and 7B model cards it is said that this model is optimised for inference. But it is one of the most slow models among 7B ones. May be I am doing something wrong, or don't have required libraries, but the use example code is producing very slow results on A100 or RTX8000. Is it common problem or am I doing something wrong?

Slow for me also, on a RTX3090. Orders of magnitude slower than other 7B models I've tried.
After warming up, other models summarize an article in 2 to 10 seconds. Falcon takes about 2 minutes for the same article.
I double checked that it's using the GPU and tried running a quantized version, but still slow.

@patonw can you please let me know which prompt/parameter you are using for summarization task? i'm struggling with 7 B models to get a more or less stable and factual correct summary. thank you

@patonw aren't quantized models always slow in comparison to models in float16?

It's really slow for me also

@Sven00 I didn't any official examples for summarization prompts either, but through trial and error I found this works fairly well:

INSTRUCTIONS:
You are a political analyst for a national newspaper.
Only refer to the provided text and no other sources.
Summarize 5 key facts from the following text as a numbered list.

TEXT:
###
{text}
###

SUMMARY:

However, the model neither numbers items or counts correctly

@HAvietisov Running quantized is slightly faster for this model on my hardware at least, but not by much.

@patonw what hardware you use and what quantization method?
I run int8 quantization via bitsandbytes, with dequantization to float16 on single rtx 3090

Changing torch_dtype=torch.bfloat16 to torch_dtype=torch.float16 in the Getting Started code snippet (removing the "b" before "float") led to a significant speedup on a 16GB vRAM NC4as-v3 machine in databricks running the falcon-7b-instruct model. Hope this helps others, too.

@rustamg thanks for sharing! Any idea how much of a drop in accuracy this could cause?

usually How long it takes for warmup steps finish for fully finetune?

Sign up or log in to comment