Reasoning Models
At a reasonable price, I can rent 3 A40s and test a Q6_K 9b model in ~1.2 minutes and a 72b model in ~9.2 minutes. In order to keep the eval times down, many of the test prompts have the llm answer multiple questions in its response, as well as having the model not say anything except for its answer to the questions. Allowing models to explain each answer they give in a couple paragraphs like they normally would would multiply the testing time potentially by 4-5. This would especially be an issue with the political test, which has 288 questions.
I don’t see how I could feasibly benchmark reasoning models. Even just allowing them to think for 2 paragraphs would likely be too time-consuming given the total number of questions in all of the benchmarks. But most reasoning models think for a lot longer than 2 paragraphs, sometimes as much as 5,000 words (like 30-40 paragraphs). Testing a single reasoning model on all of the questions would either take many hours or I’d have to pay a lot more for better gpus. There’s also the question of how long do you let a model think for before deciding that it will never stop its thinking process.
If anyone has a solution to this, I’d love to hear it, but as of now the leaderboard doesn’t really support reasoning models.
Seems I might be able to reduce eval times enough by switching over from llama-cpp-python to Tabby or Aphrodite to utilize batching. Wish I looked into this sooner.
Seems I might be able to reduce eval times enough by switching over from llama-cpp-python to Tabby or Aphrodite to utilize batching. Wish I looked into this sooner.
Maybe im a bit late but you can also try using vllm or SGLang. SGLang was codeveloped with the deepseek guys to help with distributed inference, and from my experience its impressively fast, but it doesnt support gguf unlike vllm.
Maybe im a bit late but you can also try using vllm or SGLang. SGLang was codeveloped with the deepseek guys to help with distributed inference, and from my experience its impressively fast, but it doesnt support gguf unlike vllm.
Any idea on how SGLang compares to Aphrodite? I saw some people saying vllm and Aphrodite are pretty much the same speed. So I'm wondering if the decision between all these is just which is the most intuitive and has the best model support.
Any idea on how SGLang compares to Aphrodite? I saw some people saying vllm and Aphrodite are pretty much the same speed. So I'm wondering if the decision between all these is just which is the most intuitive and has the best model support.
SGLang is most focused on enterprise inference, so gpu-poors like me cant find that much use of it, only if you have enough vram. It mainly supports awq, gptq and other bigger quants, plus various kvcache quants. No gguf support or cpu offload whatsoever.
Aphrodite is a spinoff of vllm, looks like an extension of it(more models better docs), but I haven't checked it in a while. Not sure if there is any edge in using Aphrodite over vLLM. If you can rent enough VRAM and figure out SGLang that would be the fastest if you run the benches parallel.
Are you testing reasoning models now?
Not really. Local reasoning models still don't work with my eval program, but since API models only output the final answer, not the thinking tokens, they are more or less supported.