Draft Model Performance Metrics and Recommendations

#3
by ernestr - opened

I wanted to share some metrics in case folks are curious.

I'm running CohereForAI_c4ai-command-a-03-2025-Q4_K_L.gguf as my main model with c4ai-command-r7b-12-2024-Q4_K_L.gguf as the draft on 6 RTX A4000s on a server with PCIe v3. It's averaging about 9.5 t/s for long completions. I've seen it jump to 20 t/s for short ones.

Is there another model that would be appropriate to run as a draft here? Command-R 7b is the only one I could find with the same tokenizer. Would love to throw in a 1b or 3b and compare.

@bartowski thanks again for such great quants.

oh snap that's handy :O didn't know they'd be compatible, nice find!!

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment