The f16 with 32k ctx fits nicely in 24GB VRAM
Just what the title says. In early limited testing the f16 model at full 32k ctx fully offloads on my 3090TI with 24GB VRAM providing ~50tok/sec inference. Most impressive 8B I've personally tried so far!
This is great! I wouldn't have thought the 32k in f16 could fit in 25GB VRAM! Thanks for sharing, it helps others to calculate how much they can offload to the GPU.
If the model unaligned then it is probably the best model, because I tried dolphin-2.9 and that is horribly slow, normally my machine outputs 2.3 tokens per second (on CPU), but that was doing .3 tokens per second with 4k quants and context size was also not very long (just 512 or perhaps even shorter) and yet out putting tokens at slowest rate. I hope this one would be normal or even better for speed.
@supercharge19
out of curiosity since you seem more knowledgeable on the matter, isn't dolphin-2.9
also a fine-tuned model on Llama-3-8B? Is it possible to have different speed from fine-tunes of the same model? (chat template, fine-tune technique, etc.)
Don't be humble please, though dolphin-2.9 is indeed is finetuned on the same model (llama-3-8b) but I heard (for mistral) some fine tuning rendered models slower than original (slower than original, or at least context window got shorter i.e. quality suffered for same context length) and I don't think any other model (like llama based ones) would be any different.
Downloaded this one, only will test and try and return with results later.
Speed good @2.23 tokens per second but generation quality sucks (actually there is a minor fault, opening separate discussion for that now).