Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
cogwheelhead 
posted an update Dec 12, 2024
Post
286
Hey!

Me and my team recently released two benchmarks on university-level math: U-MATH (for University-MATH) and μ-MATH (for Meta U-MATH).

We're working a lot on complex reasoning for LLMs, and we were in particular interested in evaluating university-curricula math skills — in topics such as differential calculus and linear algebra — for their wide applicability and practicality.

We noticed that available benchmarks at the time were either at or below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds.

We wanted focus on university curricula and we wanted "organic" variety, so we created our own benchmark using problems sourced from actual teaching materials used in top US universities — that is how U-MATH came to be.

We also, and that is my primary focus in particular, are very eager on studying and improving evaluations themselves, since the standard llm-as-a-judge approach is known to be noisy and biased, but that often remains unaccounted for. So we then created a U-MATH-derived benchmark to do "meta-evaluations" — i.e. evaluate the evaluators — which allows to quantify their error-rates, study their behaviors and biases, and so on.

I'm super excited to be sharing those publicly!

toloka/u-math
toloka/mu-math
In this post