Results may vary?

#970
by altomek - opened

Just found this:

image.png

Looks like one model can vary few points in each evaluation run.

Open LLM Leaderboard org

Hi @altomek ,

Thanks for your question! Yes, this can happen when a model was evaluated with different precision, and this is exactly such case. You can see on my screenshot, that nisten/franqwenstein-35b was evaluated in bfloat16 and float16. Plus, bfloat16 version includes chat_template

Screenshot 2024-10-09 at 12.59.08.png

Open LLM Leaderboard org

I think it should be clear now, so let me close this discussion. Feel free to ping me here in case any other questions about this case, or open a new discussion!

alozowski changed discussion status to closed

Oh, this explains a lot! Thank you. To be frank, I expect some differences between runs but still this looked suspicious.

Open LLM Leaderboard org

My hypothesis is that the chat template can affect GPQA in this case. However, all the evaluation details are open, feel free to clone them if you want to inspect things – https://huggingface.co/datasets/open-llm-leaderboard/nisten__franqwenstein-35b-details

Sign up or log in to comment