Spaces:

DontPlanToEnd
/

UGI-Leaderboard

Running

App Files Files Community

209

Reasoning Models

#108

pinned

by DontPlanToEnd - opened Jan 23

Discussion

DontPlanToEnd

Owner Jan 23

At a reasonable price, I can rent 3 A40s and test a Q6_K 9b model in ~1.2 minutes and a 72b model in ~9.2 minutes. In order to keep the eval times down, many of the test prompts have the llm answer multiple questions in its response, as well as having the model not say anything except for its answer to the questions. Allowing models to explain each answer they give in a couple paragraphs like they normally would would multiply the testing time potentially by 4-5. This would especially be an issue with the political test, which has 288 questions.

I don’t see how I could feasibly benchmark reasoning models. Even just allowing them to think for 2 paragraphs would likely be too time-consuming given the total number of questions in all of the benchmarks. But most reasoning models think for a lot longer than 2 paragraphs, sometimes as much as 5,000 words (like 30-40 paragraphs). Testing a single reasoning model on all of the questions would either take many hours or I’d have to pay a lot more for better gpus. There’s also the question of how long do you let a model think for before deciding that it will never stop its thinking process.

If anyone has a solution to this, I’d love to hear it, but as of now the leaderboard doesn’t really support reasoning models.

DontPlanToEnd pinned discussion Jan 23

DontPlanToEnd

Owner Jan 25

Seems I might be able to reduce eval times enough by switching over from llama-cpp-python to Tabby or Aphrodite to utilize batching. Wish I looked into this sooner.

BazsiBazsi

23 days ago

•

edited 23 days ago

Seems I might be able to reduce eval times enough by switching over from llama-cpp-python to Tabby or Aphrodite to utilize batching. Wish I looked into this sooner.

Maybe im a bit late but you can also try using vllm or SGLang. SGLang was codeveloped with the deepseek guys to help with distributed inference, and from my experience its impressively fast, but it doesnt support gguf unlike vllm.

DontPlanToEnd

Owner 23 days ago

Maybe im a bit late but you can also try using vllm or SGLang. SGLang was codeveloped with the deepseek guys to help with distributed inference, and from my experience its impressively fast, but it doesnt support gguf unlike vllm.

Any idea on how SGLang compares to Aphrodite? I saw some people saying vllm and Aphrodite are pretty much the same speed. So I'm wondering if the decision between all these is just which is the most intuitive and has the best model support.

BazsiBazsi

22 days ago

Any idea on how SGLang compares to Aphrodite? I saw some people saying vllm and Aphrodite are pretty much the same speed. So I'm wondering if the decision between all these is just which is the most intuitive and has the best model support.

SGLang is most focused on enterprise inference, so gpu-poors like me cant find that much use of it, only if you have enough vram. It mainly supports awq, gptq and other bigger quants, plus various kvcache quants. No gguf support or cpu offload whatsoever.
Aphrodite is a spinoff of vllm, looks like an extension of it(more models better docs), but I haven't checked it in a while. Not sure if there is any edge in using Aphrodite over vLLM. If you can rent enough VRAM and figure out SGLang that would be the fastest if you run the benches parallel.

yamatazen

21 days ago

Are you testing reasoning models now?

DontPlanToEnd

Owner 21 days ago

Are you testing reasoning models now?

Not really. Local reasoning models still don't work with my eval program, but since API models only output the final answer, not the thinking tokens, they are more or less supported.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment