Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1054

Results may vary?

#970

by altomek - opened Oct 7, 2024

Discussion

altomek

Oct 7, 2024

Just found this:

Looks like one model can vary few points in each evaluation run.

alozowski

Open LLM Leaderboard org Oct 9, 2024

Hi @altomek ,

Thanks for your question! Yes, this can happen when a model was evaluated with different precision, and this is exactly such case. You can see on my screenshot, that nisten/franqwenstein-35b was evaluated in bfloat16 and float16. Plus, bfloat16 version includes chat_template

alozowski

Open LLM Leaderboard org Oct 9, 2024

I think it should be clear now, so let me close this discussion. Feel free to ping me here in case any other questions about this case, or open a new discussion!

alozowski changed discussion status to closed Oct 9, 2024

altomek

Oct 9, 2024

Oh, this explains a lot! Thank you. To be frank, I expect some differences between runs but still this looked suspicious.

alozowski

Open LLM Leaderboard org Oct 9, 2024

My hypothesis is that the chat template can affect GPQA in this case. However, all the evaluation details are open, feel free to clone them if you want to inspect things – https://huggingface.co/datasets/open-llm-leaderboard/nisten__franqwenstein-35b-details

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment