MMLU reproducibility issue

#185
by ykoreeda - opened

Hi! Thanks for the effort in putting everything together.
I am trying to replicate some of the experiments done here locally. Unlike what is presented in "Reproducibility" section, [linked Eleuther AI Harness](this version) with the presented command does not create "all" entry in the output JSON file. Did you do some kind of postprocess to convert output JSON from AI Harness to what is documented at results repo?
This is especially problematic for MMLU because simply taking macro average of all hendrycksTests-* does not give us the same number as what is present in main entry.

Open LLM Leaderboard org
edited Aug 11, 2023

Hi!
Yes, we have our own post-processing of the end results, to allow us to debug evaluations failures faster, I'll update the About to make it clearer.
The MMLU score that is displayed is simply the average of all MMLU sub-scores, you can see how this is computed in the code of the front end here.

Could you please tell me what you were expecting and for which model?

clefourrier changed discussion status to closed

Sign up or log in to comment