Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
2
3
4
Vitaliy Polshkov
cogwheelhead
Follow
jayanth-m007's profile picture
Tonic's profile picture
tilgasergey's profile picture
3 followers
·
1 following
cogwheelhead
AI & ML interests
Data-centric AI methods; Experiment design, statistical learning and causal inference approaches for LLM / agentic evaluations; "Proper old-school" RL and meta-learning techniques for LLM training
Recent Activity
posted
an
update
15 days ago
Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models) Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: https://huggingface.co/spaces/toloka/u-math-leaderboard) tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability: * performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025) * R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests) * same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving) * R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops) Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here
updated
a Space
19 days ago
toloka/u-math-leaderboard
liked
a Space
21 days ago
toloka/u-math-leaderboard
View all activity
Organizations
cogwheelhead
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
a Space
21 days ago
Running
8
8
U Math Leaderboard
🥇
U-MATH and μ-MATH leaderboard
liked
2 datasets
3 months ago
toloka/mu-math
Viewer
•
Updated
Dec 5, 2024
•
1.08k
•
186
•
21
toloka/u-math
Viewer
•
Updated
Dec 5, 2024
•
1.1k
•
374
•
18
liked
a dataset
6 months ago
toloka/beemo
Viewer
•
Updated
Jan 28
•
2.19k
•
319
•
14