Post
1320
π A new LLM is launched! π
After checking if it's open-source or not, π€
you rush to see the benchmarks... πββοΈπ¨
Which benchmark does everyone check first? π
MMLU (Massive Multitask Language Understanding)? π
Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! πβ
Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! π―π
π§ͺ 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans π§βπ¬
π Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing π―
π 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench π
π€ LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations π§ π‘
Any guess who tops it and who bombs it? π€ππ
GPT-4o drops by 17% (from 0.887 to 0.7149) π
Llama-3-70B drops by 27% (from 0.820 to 0.5541) π
π TIGER-Lab/MMLU-Pro
After checking if it's open-source or not, π€
you rush to see the benchmarks... πββοΈπ¨
Which benchmark does everyone check first? π
MMLU (Massive Multitask Language Understanding)? π
Benchmarks like MMLU reaching saturation... most of the time the performance does not translate to real-world use cases! πβ
Meet MMLU-Pro, released by TIGER-Lab on @huggingface ! π―π
π§ͺ 12,217 questions across biology, business, chemistry, computer science, economics, engineering, health, history, law, mathematics, philosophy, physics, and psychology carefully validated by humans π§βπ¬
π Goes to 10 options per question instead of 4, this increase in options will make the evaluation more realistic and reduce random guessing π―
π 56% of questions come from MMLU, 34% from STEM websites, and the rest from TheoremQA and SciBench π
π€ LLMs with weak chain-of-thought reasoning tend to perform lower, indicating it is more challenging and representative of real-world expectations π§ π‘
Any guess who tops it and who bombs it? π€ππ
GPT-4o drops by 17% (from 0.887 to 0.7149) π
Llama-3-70B drops by 27% (from 0.820 to 0.5541) π
π TIGER-Lab/MMLU-Pro