mixtral format?

#1
by KnutJaegersberg - opened

is it possible to reformat this moe into mixtral or llama format?

I would think that benefits inference speed

I would think that benefits inference speed

Not really!
Please view the paper:
https://arxiv.org/abs/2405.04434

when I try inference locally, mixtral 8x7b is faster, although it is bigger.

KnutJaegersberg changed discussion status to closed

@KnutJaegersberg
Can you check how your cpu core use looks like when inferencing Deepseek v2 Lite?

When I am inferencing it on 24GB VRAM in ooba (transformers loader on Windows) with load_in_4bit=True or load_in_8bit=True (would OOM with 16-bit model) I notice that single core is at 100%. If it can be replicated, we can assume that for some reason single thread is used and it's a bottleneck.

For what it's worth, in llama.cpp this issue doesn't occur and I have sensible 67 t/s with q8_0 quant and no single-core bottleneck. Flash Attention doesn't work there yet with deepseek v2 though.

Sign up or log in to comment