Inference time is much longer than reported

#25
by jeff-gao - opened

In the paper, it says the inference speed is < 3ms per token using a single A100-80G.
However when I test with the sample code on a single A100-80G, the inference speed is around 28ms per token. My torch version is 2.0.1.
May I know how to make the inference speed to be around 3ms?
Thank you very much!

jeff-gao changed discussion title from Inference speed is much longer than reported to Inference time is much longer than reported

Did you used DeepSpeed to run inference? I am not sure, but it seems that they used it as it is mentioned on the model card.

Microsoft org

Hello @jeff-gao !

This mismatch was caused by the absence of Flash-Attention in the model files. We opted to not add it at first to keep the implementation simple, but we plan in adding an option that uses such implementation to take advantage of faster inferences.

Hello @jeff-gao !

This mismatch was caused by the absence of Flash-Attention in the model files. We opted to not add it at first to keep the implementation simple, but we plan in adding an option that uses such implementation to take advantage of faster inferences.

Hello @gugarosa , thank you very much! Looking forward to your implementations !!!

Hey @jeff-gao - since at fp16 it takes only 3.16 GB VRAM, can we run 24 copies (approximately) of Phi 1.5 on an A100-80GB GPU?

If that is possible and 3ms per token is also achievable with flash attention - can we generate 7200 tokens (24 copies Γ— 300 tokens per second) per second on a A100-80GB GPU?

I'm a non-technical guy. Just asking out of curiosity. Thanks. πŸ™πŸΌ

gugarosa changed discussion status to closed

Sign up or log in to comment