Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
clefourrier 
posted an update Mar 29
Post
2351
Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

That's not surprising in the slightest, to be honest. That said, and apologies if I'm mistaken, but aren't you supposed to reset the context window in between questions (while keeping options in a predetermined order) in all those semi-automated tests to avoid the very thing you mention?

edit: Okay, after re-re-reading this post (because it's unbelievable it is a thing to begin with), are you really telling us that the companies/entities/persons doing those evals (including leaderboards HuggingFace is hosting) are doing it manually, without context reset, and with zero oversight? And no one took notice? 🫤

·

I have no idea what you are talking about by "resetting the context window". Every question is asked independently of the previous ones, and has its own context and context window.

None of these evaluations are usually done manually (we use lm_eval for the Open LLM Leaderboard for ex), but people rarely report their precise evaluation setup; papers often contain things like "we evaluate MMLU in 5-shot and it gives us this amazing SOTA result", without providing the precise evaluation script, the prompt, etc.

As we can see in this post, even an extremely minor change to an evaluation prompt (providing the few shot samples, fixed, in one order or another) can have a magnitude of impact on the results. Hence why people really should only report results with follow up evaluation scripts.

this seems strange, I wonder why?

·

The most likely explanation is that some examples are better/worse for specific models. There were some interesting discussions on twitter if you're curious :)