Detailed results on MMLU-Medical

by maximegmd - opened Mar 2, 2024

Mar 2, 2024

Hello,

I am trying to run the MMLU medical tasks on Meditron 7b but obtain low results inconsistent with your reported average of 54.2, I am convinced the issue is with my testing methodology (using lm-eval-harness). Could you please communicate the results on the tasks included in MMLU-Medical for a fair comparison with other models?

Thanks in advance

zechen-nlp

EPFL LLM Team org Mar 8, 2024

The 54.2 is from Meditron-7B finetuned on MedMCQA, not the base Meditron-7B model. In the paper we reported the base Meditron-7b's performance with in-context learning (3-shots, 3 run with 3 random seeds): 42.3±2.37. However, we don't have the fine-grained performance of the in-context runs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment