about 310 tokens/s

by lchustc - opened May 16, 2023

May 16, 2023

Hi，i reproduced the LyraChatGLM in my A100 and got the speed of 71 tokens/s.
I wonder under what settings you achieved your 310 tokens/s?
I have tested the speed of the LyraChatGLM and the raw ChatGLM under condition batch_size=1(31 tokens/s and 71 tokens/s).

bigmoyan

Tencent Music Entertainment Lyra Lab org May 17, 2023

•

edited May 17, 2023

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)

lchustc

May 17, 2023

•

edited May 17, 2023

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)

Hi，I have achieved correct batch(bs=8) inference of original ChatGLM and got the speed of 137 token/s in my A100.
You can see this issue https://github.com/THUDM/ChatGLM-6B/issues/745

bigmoyan

Tencent Music Entertainment Lyra Lab org May 17, 2023

@lchwhut good job! we started this project based on old version and didn't notice this update.

I'll update readme to make it clearer

xiangli

May 25, 2023

Hi, @bigmoyan
what's the input length for 310 tokens/s?
Thanks.

feilix

Jun 2, 2023

•

edited Jun 2, 2023

@lchwhut Short answer: try batch_size = 8

to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)
@bigmoyan @lchustc I got 70 tokens/s in my A100 with batch_size = 8.
Can you share your demo code with batch_size=8?

vanewu

Tencent Music Entertainment Lyra Lab org Jun 2, 2023

Everything is updated. Please try the new version.

vanewu changed discussion status to closed Jun 2, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment