Spaces:
Running
on
CPU Upgrade
Results may vary?
Hi @altomek ,
Thanks for your question! Yes, this can happen when a model was evaluated with different precision, and this is exactly such case. You can see on my screenshot, that nisten/franqwenstein-35b
was evaluated in bfloat16
and float16
. Plus, bfloat16
version includes chat_template
I think it should be clear now, so let me close this discussion. Feel free to ping me here in case any other questions about this case, or open a new discussion!
Oh, this explains a lot! Thank you. To be frank, I expect some differences between runs but still this looked suspicious.
My hypothesis is that the chat template can affect GPQA
in this case. However, all the evaluation details are open, feel free to clone them if you want to inspect things – https://huggingface.co/datasets/open-llm-leaderboard/nisten__franqwenstein-35b-details