Draft Model Performance Metrics and Recommendations
#3
by
ernestr
- opened
I wanted to share some metrics in case folks are curious.
I'm running CohereForAI_c4ai-command-a-03-2025-Q4_K_L.gguf
as my main model with c4ai-command-r7b-12-2024-Q4_K_L.gguf
as the draft on 6 RTX A4000s on a server with PCIe v3. It's averaging about 9.5 t/s for long completions. I've seen it jump to 20 t/s for short ones.
Is there another model that would be appropriate to run as a draft here? Command-R 7b is the only one I could find with the same tokenizer. Would love to throw in a 1b or 3b and compare.
@bartowski thanks again for such great quants.
oh snap that's handy :O didn't know they'd be compatible, nice find!!