Detailed results on MMLU-Medical
#8
by
maximegmd
- opened
Hello,
I am trying to run the MMLU medical tasks on Meditron 7b but obtain low results inconsistent with your reported average of 54.2, I am convinced the issue is with my testing methodology (using lm-eval-harness). Could you please communicate the results on the tasks included in MMLU-Medical for a fair comparison with other models?
Thanks in advance
The 54.2 is from Meditron-7B finetuned on MedMCQA, not the base Meditron-7B model. In the paper we reported the base Meditron-7b's performance with in-context learning (3-shots, 3 run with 3 random seeds): 42.3±2.37. However, we don't have the fine-grained performance of the in-context runs.