CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Abstract
With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 20 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.
Community
The problem with this analysis is that it ignores the fact that most codebases are large which means information retrieval is critical. I'm addition to use the responses of the LLM you need the LLM to respond in the right format (example code diff). So basically the numbers are artificially inflated because o1 mini is good in the special situation that this approach is measuring.
That in itself doesn't make this analysis useless, however the fact that we don't know what the elo rating represents is ultimately the main problem of this article. For those who wrote it - when you say all other analysis fail to provide a rating that understands how these models perform on coding tasks, you also make the claim that your method works better but your paper indeed makes the exact same mistake. People use LLMs for coding tasks very differently for example a company like ours that uses LLMs to automatically generate code has very different needs from someone who is an engineer using the LLM as a research tool. So the million dollar question - exactly what does your rating represent? Which type of user should use o1 or qwen and for what tasks?
Thank you for your attention and inspiring questions. I am glad to discuss with you as the first author of this paper.
First, "exactly what does your rating represent?" The ratings we provide fall within the field of "competition-level code generation." This refers to an idealized scenario where we tackle clearly defined competition coding problems in the same manner as human competitors, without relying on retrieval aids.
Second, "Which type of user should use o1 or qwen and for what tasks?" This paper may not fully address this question. However, based on the results and our previous experiences, if a problem requires complex design and reasoning, o1-like reasoning models may prove useful. Conversely, for tasks that are knowledge-intensive, such as function calls from Python libraries, or require adherence to specific complex formats, reasoning models might not have such a big advantage. Besides, a notable benefit of Qwen compared to o1 is its open-source nature. If you are considering open-source models, our results indicate that the Qwen series is a strong choice across any parameter range.
In summary, our benchmark provides an assessment in "competition-level code generation," highlighting the model's capability for sophisticated code reasoning and allowing direct comparison with human coding competitors. However, there are also some engineering-focused benchmarks that may better suit application-oriented evaluations, such as MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering (from OpenAI), and FullStack Bench: Evaluating LLMs as Full Stack Coders (from Bytedance). You might consider reviewing these alongside our results for a broader perspective.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation (2024)
- GenX: Mastering Code and Test Generation with Execution Feedback (2024)
- Benchmarking Large Language Models with Integer Sequence Generation Tasks (2024)
- ConAIR:Consistency-Augmented Iterative Interaction Framework to Enhance the Reliability of Code Generation (2024)
- WarriorCoder: Learning from Expert Battles to Augment Code Large Language Models (2024)
- Evaluating ChatGPT-3.5 Efficiency in Solving Coding Problems of Different Complexity Levels: An Empirical Analysis (2024)
- GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Why wasn't Gemini tested? They beat o1 in human-evaluated A/B testing currently, including the coding section. https://huggingface.co/spaces/lmarena-ai/chatbot-arena-leaderboard
There is still a gap between human-evaluated A/B testing and competition-level code generation, and we noticed that not so much works test on Gemini currently.
But we will consider testing it and several other models in the next version of the paper. ๐ค
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper