about 310 tokens/s
Hi,i reproduced the LyraChatGLM in my A100 and got the speed of 71 tokens/s.
I wonder under what settings you achieved your 310 tokens/s?
I have tested the speed of the LyraChatGLM and the raw ChatGLM under condition batch_size=1(31 tokens/s and 71 tokens/s).
@lchwhut Short answer: try batch_size = 8
to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)
@lchwhut Short answer: try batch_size = 8
to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)
Hi,I have achieved correct batch(bs=8) inference of original ChatGLM and got the speed of 137 token/s in my A100.
You can see this issue https://github.com/THUDM/ChatGLM-6B/issues/745
@lchwhut good job! we started this project based on old version and didn't notice this update.
I'll update readme to make it clearer
@lchwhut Short answer: try batch_size = 8
to get full speed, we need to improve computation parallelism to fully use the whole available resources. We modified original chatGLM batch preparation method to make it works correctly under KV cache optimization, thus in batch mode we can run much faster than original version. ( actually original chatGLM doesn't support batch inference at all, it cannot inference correctly)
@bigmoyan @lchustc I got 70 tokens/s in my A100 with batch_size = 8.
Can you share your demo code with batch_size=8?
Everything is updated. Please try the new version.