[Feedback welcome] Add evaluation results to model card metadata

#40
by Wauplin HF staff - opened
Hugging Face H4 org
β€’
edited Nov 29, 2023

This is a work in progress. The goal is to list evaluation results in the model card metadata, especially the results from the Open LLM Leaderboard. This PR has not been created automatically.

Pending questions:

  1. Should we report all metrics for each task? (especially the _stderr ones?) Or only the one that is displayed in the LLM Leaderboard?
  2. Are the dataset type/name/config/split/num_few_shot accurate in the suggested changes?
  3. How to report the MMLU results? There are 57 different hendrycksTest datasets for a total of 228 metrics? 😡
  4. How to report MT-Bench results? (asking since they are reported in the model card but not in the metadata)
  5. How to report AlpacaEval results? (asking since they are reported in the model card but not in the metadata)

Related thread: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/370#65663f60589e212284db2ffc.
Related PR in the Hub docs: https://github.com/huggingface/hub-docs/pull/1144.

Thanks to @clefourrier who guided me with the Open LLM Leaderboard results πŸ€—

cc @julien-c @lewtun @Weyaxi

Wauplin changed pull request title from [WIP] Add evaluation results to model card metadata to [Feedback welcome] Add evaluation results to model card metadata

Should we report all metrics for each task? (especially the _stderr ones?) Or only the one that is displayed in the LLM Leaderboard?

  1. In my opinion, the one displayed on the LLM Leaderboard would be a better choice because people generally want to know those results. Also, that can confuse things a little. On the other hand, the other metrics can show a more detailed version of the results.

How to report the MMLU results? There are 57 different hendrycksTest datasets for a total of 228 metrics? 😡

  1. Hmm, I think something like 'Overall MMLU' could work, but I'm not sure about that.
Hugging Face H4 org
  1. For the leaderboard, only one metric (the reported one as @Weyaxi suggested) should be enough, especially if you provide the hyperlink to the details
  2. From a first look:
    • ARC: OK
    • HellaSwag: dataset = hellaswag, split = validation
    • DROP: they actually apply a post process to the drop dataset in the harness but I think saying drop should be fine anyway, split = validation
    • TruthfulQA: OK
    • GSM8K: config = main
    • MMLU: dataset = cais/mmlu, config = all of them (if you want to provide the list it's in the about of the leaderboard), split = test
    • Winogrande: dataset = winogrande, config = winogrande_xl, split = validation
  3. For MMLU, we report the average of all acc scores, so "Aggregated MMLU", with as metric "avg(acc)" for example - People wanting to get the detail should go read it themselves as it's just going to be overwhelming elsewise
Hugging Face H4 org
β€’
edited Nov 30, 2023

Thanks both for the feedback!

I pushed changes in 5ae48397:

  • only 1 metric per benchmark (keeping the one on the leaderboard as suggested)
  • add MMLU results => keep only 1 global result
  • add Winogrande (thanks @clefourrier for noticing it was missing :D)
  • corrected the few dataset/config/split that were not accurate.

Looks like we have a good final version now :)

Hugging Face H4 org

Thanks a lot for adding this clean evaluation index! I think for AlpacaEval we can point the type to https://huggingface.co/datasets/tatsu-lab/alpaca_eval

Apart from that, this LGTM πŸ”₯

Hugging Face H4 org

Thanks everyone for the feedback! Let's merge this :)

lewtun changed pull request status to merged

great PR all!

Sign up or log in to comment