mixtral format?

by KnutJaegersberg - opened

is it possible to reformat this moe into mixtral or llama format?

I would think that benefits inference speed

I would think that benefits inference speed

Not really!
Please view the paper:

when I try inference locally, mixtral 8x7b is faster, although it is bigger.

KnutJaegersberg changed discussion status to closed

Can you check how your cpu core use looks like when inferencing Deepseek v2 Lite?

When I am inferencing it on 24GB VRAM in ooba (transformers loader on Windows) with load_in_4bit=True or load_in_8bit=True (would OOM with 16-bit model) I notice that single core is at 100%. If it can be replicated, we can assume that for some reason single thread is used and it's a bottleneck.

For what it's worth, in llama.cpp this issue doesn't occur and I have sensible 67 t/s with q8_0 quant and no single-core bottleneck. Flash Attention doesn't work there yet with deepseek v2 though.

Sign up or log in to comment