MMLU Lower Results Theory

#5
by fblgit - opened

A model that performs better on "everything" must perform better or similar in MMLU as well , otherwise something must be wrong.

Have u tried to transform MMLU into a turn TrivialPursuit kind of Q&A session ?

This comment has been hidden
OpenChat org

It's likely that Llama-3-8b instruct official results were obtained with special prompts. The measured results are quite low, see here for discussions.

The score marks of a rusty test are irrelevant. I believe here we are trying to really prove this model and strategy to be most optimal right? :D

fblgit changed discussion status to closed

Sign up or log in to comment