Speed Delta Between Moe and Non-Moe Models
Hello Qwen team and community.
Congrats on a great release!
I've been running some long context tests lately, over 20k tokens.
There are some big differences in the performance of moe models (this one & mixtral & deepseek) and non moe models (qwen 7b all the way 110b, and various other architectures).
I'm on a mac m2 ultra machine, using the llama-cpp-python library to run inference.
All the models i test are in q8, with flash attention, they are in their entirety offloaded to the gpu, as are [K, Q, V].
My measurements do not include modes loading time + i use mlock.
Using the same long prompt, the 57b-a14b model required 621 seconds from start to finish.
On the other hand the almost double sized qwen1.5 110b model, required on that same prompt 585 seconds.
Another comparison (although not as apple-to-apples) could be drawn on qwen2 7b which took 42 seconds and mixtral 8x7b at 139 seconds or deepseek's coder lite which is a 16b moe model with only 2.7b active parameters that took 216 seconds.
Any ideas what is going on and how it can be addressed?
Thanks, sorry for the long post!