Spaces:
Running
on
CPU Upgrade
Suggestion: Adding outlier-resistant averaging methods
Add an option of outputting model parameters, taking into account exploding (very large values in one of the columns). To be able to find models, average equals, capable of solving all of the problems presented here in the tests.
Hello. This is more about ?calculating the results?. In general, if we look at Figure 1, we can see that the first model, by average1, should be in second place. But by average2, it is in first place.
The second half is calculated by taking into account values that are very different in the results.
As an example, I will set one of the parameters of the third model to 9,000.
Here we can see that in the first table, due to the calculation of the mean, model 3 is in the lead, but in the second table it can only rank second.
The same is true if we set the model two parameters as 9,000.
It is only when we set the 3 parameters as 9,000 that model 3 and in the second table ranks 1 in terms of average.
Something like that. Unfortunately, I'm not very good at explaining things.
I think I got your idea, thank you!
You're pointing out that the current method of calculating averages doesn't account for extreme values in one or more columns, which can skew the results. So the goal of harmonising the average score is to find models that perform well across all tasks, rather than letting outliers dominate the average score
This idea makes sense, we need to discuss it internally and I will get back to you with my answer
Let me rename the discussion, feel free to correct me
Ok. And yes, you probably have the right definition.
I'm back with our thoughts – we've decided to maintain our current arithmetic mean approach due to its simplicity and wide understanding. Plus, since we're currently normalising the scores, we're mitigating an outlier effect
Nevertheless, I will keep in mind your approach and might get back to it later
Let me close this discussion for now, we greatly appreciate your involvement! Please, feel free to share any your ideas here in discussions and don't hesitate to ask questions in case of any problems!
As an option, this approach can be added to another column.
Yes, we discussed it as a separate column, but the logic remains the same as I've described above
if anyone here wants only: over-fitting (too high score) outlier suppression, but not remove low-score outliers,
then maybe some mix of geometric mean and harmonic mean,
instead using odds ratio (should be truncated at extremes near 1 or 0, to avoid problems) based averaging may instead keep significance of result e.g. 0.99 vs 0.95 vs 0.9, and 0.01 vs 0.03 vs 0.1,
and using a weighted average of these, is maybe an option?
Hi!
We try to avoid adding too many options to the leaderboard to keep it usable by the majority of people. If you want to compute your own custom geometric/harmonic means on the results, you can do so by downloading the contents here: https://huggingface.co/datasets/open-llm-leaderboard/contents/tree/main