About MMLU evaluation

#12
by ldwang - opened

Thank you for sharing.

Some models, like Qwen1.5B Phi1.5, typically use a 5-shot setting to measure MMLU.
And cosmo-1b also used the same setting https://huggingface.co/blog/cosmopedia#training-stack.

Can you explain why here MMLU evaluations are changed to a zero-shot plus option content approach?

Thank you.

Hugging Face TB Research org
edited Aug 21

Hi, we use the same evaluation setup now for our internal projects (same as FineWeb and FineWeb-Edu ablations) where we do zero-shot for all the benchmarks

Sign up or log in to comment