Code Evaluation
Collection of Papers on Code Evaluation (from code generation language models)
Paper • 2311.07989 • Published • 21Note great overview and a lot of additional references! frequently updated list: https://github.com/codefuse-ai/Awesome-Code-LLM
Evaluating Large Language Models Trained on Code
Paper • 2107.03374 • Published • 6Note introduces HumanEval and pass@k
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper • 2310.06770 • Published • 3Note has already been used for marketing: Devin Most promising benchmark currently. They model the actual software engineer by using github issues as inputs, the whole repository as resource. This benchmark tests systems - not just models. So you can have agent like managers and retrieval systems. rapidly advancing leaderboard: https://www.swebench.com/
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
Paper • 2102.04664 • Published • 2Note an older work that consists of several easy task for encoder and decoder models. For example line completion or min/max cloze test
Out of the BLEU: how should we assess quality of the Code Generation models?
Paper • 2208.03133 • Published • 2Note human judgement doesn't agree with static metrics (BLEU, ChrF, RUBY ...)
ReCode: Robustness Evaluation of Code Generation Models
Paper • 2212.10264 • Published • 1Note perturbations on docstrings/prompts
- Running839📈
Big Code Models Leaderboard
Textbooks Are All You Need
Paper • 2306.11644 • Published • 141Note phi-1 model, novel evaluation problems that combine two tasks to make data contamination less likely.
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 84
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Paper • 2403.07974 • Published • 1Note Annotate problems by months and spot potential contamination
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion
Paper • 2310.11248 • Published • 3CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code
Paper • 2302.05527 • Published • 1
A Static Evaluation of Code Completion by Large Language Models
Paper • 2306.03203 • Published • 3Note model context is whole file up untill funciton header, ground turth the source file, first AST, then Linter, errors in context cause errors in generation, undefined/name error most common, EOF error due to generation length
Large Language Models Are State-of-the-Art Evaluators of Code Generation
Paper • 2304.14317 • Published • 2Measuring Coding Challenge Competence With APPS
Paper • 2105.09938 • Published • 1MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation
Paper • 2208.08227 • Published • 1
Program Synthesis with Large Language Models
Paper • 2108.07732 • Published • 3Note introduces Mostly Basic Programming Problems (MBPP)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation
Paper • 2303.12570 • PublishedRepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Paper • 2306.03091 • Published • 1TACO: Topics in Algorithmic COde generation dataset
Paper • 2312.14852 • Published • 4CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper • 2401.03065 • Published • 10- Running365🏆
Can Ai Code Results
Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and Shortcomings in Code Generation Evaluation
Paper • 2401.03855 • Published
NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness
Paper • 2401.15963 • PublishedNote "non functional" -> focusses on instruct models, has classification tasks and non functional requirements like efficiency, security and also maintability. Still relies on gold labels using DiffBLEU. filters examples larger than 3k tokens using starcoder tokenizer GPT4 seems to be really good on these tasks. might be a lot of prompt engineering tho
DevEval: Evaluating Code Generation in Practical Software Projects
Paper • 2401.06401 • PublishedNote this paper has been withdrawn!
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Paper • 2305.01210 • Published • 4Note HumanEvalPlus with additional test cases has a great figure to rank HumanEval problems on passrate, showcasing that some of them are much easier than others.
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
Paper • 2306.04556 • PublishedCan It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions
Paper • 2312.12450 • Published • 1
CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation
Paper • 2404.08806 • PublishedNote Verilog generation. creativity here means ". This term refers to the capacity to think innovatively—the ability to formulate new solutions or connections that are effective and unconventional [11]." where the reference is: https://doi.org/10.1080/10400419.2012.650092
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models
Paper • 2404.03543 • Published • 15