Code Evaluation

A Survey on Language Models for Code

Paper • 2311.07989 • Published Nov 14, 2023 • 22

Note great overview and a lot of additional references! frequently updated list: https://github.com/codefuse-ai/Awesome-Code-LLM

Evaluating Large Language Models Trained on Code

Paper • 2107.03374 • Published Jul 7, 2021 • 8

Note introduces HumanEval and pass@k

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Paper • 2310.06770 • Published Oct 10, 2023 • 5

Note has already been used for marketing: Devin Most promising benchmark currently. They model the actual software engineer by using github issues as inputs, the whole repository as resource. This benchmark tests systems - not just models. So you can have agent like managers and retrieval systems. rapidly advancing leaderboard: https://www.swebench.com/

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Paper • 2102.04664 • Published Feb 9, 2021 • 2

Note an older work that consists of several easy task for encoder and decoder models. For example line completion or min/max cloze test

Out of the BLEU: how should we assess quality of the Code Generation models?

Paper • 2208.03133 • Published Aug 5, 2022 • 2

Note human judgement doesn't agree with static metrics (BLEU, ChrF, RUBY ...)

ReCode: Robustness Evaluation of Code Generation Models

Paper • 2212.10264 • Published Dec 20, 2022 • 1

Note perturbations on docstrings/prompts

1.15k

Big Code Models Leaderboard

📈

Submit code models for evaluation on benchmarks

Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 142

Note phi-1 model, novel evaluation problems that combine two tasks to make data contamination less likely.

Textbooks Are All You Need II: phi-1.5 technical report

Paper • 2309.05463 • Published Sep 11, 2023 • 87

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Paper • 2403.07974 • Published Mar 12, 2024 • 1

Note Annotate problems by months and spot potential contamination

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion

Paper • 2310.11248 • Published Oct 17, 2023 • 4

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Paper • 2302.05527 • Published Feb 10, 2023 • 1

A Static Evaluation of Code Completion by Large Language Models

Paper • 2306.03203 • Published Jun 5, 2023 • 3

Note model context is whole file up untill funciton header, ground turth the source file, first AST, then Linter, errors in context cause errors in generation, undefined/name error most common, EOF error due to generation length

Large Language Models Are State-of-the-Art Evaluators of Code Generation

Paper • 2304.14317 • Published Apr 27, 2023 • 2

Measuring Coding Challenge Competence With APPS

Paper • 2105.09938 • Published May 20, 2021 • 1

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Paper • 2208.08227 • Published Aug 17, 2022 • 1

Program Synthesis with Large Language Models

Paper • 2108.07732 • Published Aug 16, 2021 • 4

Note introduces Mostly Basic Programming Problems (MBPP)

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

Paper • 2303.12570 • Published Mar 22, 2023

Generate animated avatars from images

Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and Shortcomings in Code Generation Evaluation

Paper • 2401.03855 • Published Jan 8, 2024

Note has been renamed "Python Saga". Figure 6 is a great sign of saturation.

NoFunEval: Funny How Code LMs Falter on Requirements Beyond Functional Correctness

Paper • 2401.15963 • Published Jan 29, 2024

Note "non functional" -> focusses on instruct models, has classification tasks and non functional requirements like efficiency, security and also maintability. Still relies on gold labels using DiffBLEU. filters examples larger than 3k tokens using starcoder tokenizer GPT4 seems to be really good on these tasks. might be a lot of prompt engineering tho

DevEval: Evaluating Code Generation in Practical Software Projects

Paper • 2401.06401 • Published Jan 12, 2024

Note this paper has been withdrawn!

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Paper • 2305.01210 • Published May 2, 2023 • 3

Note HumanEvalPlus with additional test cases has a great figure to rank HumanEval problems on passrate, showcasing that some of them are much easier than others.

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Paper • 2306.04556 • Published Jun 7, 2023

CreativEval: Evaluating Creativity of LLM-Based Hardware Code Generation

Paper • 2404.08806 • Published Apr 12, 2024

Note Verilog generation. creativity here means ". This term refers to the capacity to think innovatively—the ability to formulate new solutions or connections that are effective and unconventional [11]." where the reference is: https://doi.org/10.1080/10400419.2012.650092 "fluency" is how many of the pass@k variants are unique?(but this will be skewed towards larger models, right - they sample wider)

Benchmarking Language Model Creativity: A Case Study on Code Generation

Paper • 2407.09007 • Published Jul 12, 2024 • 3

Note NeoCoder: denial prompting to get novel approaches, even outside of "historical human solutions". (cont)

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Paper • 2404.03543 • Published Apr 4, 2024 • 16

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

Paper • 2312.12450 • Published Dec 11, 2023 • 1

182

BigCodeBench Leaderboard

🥇

Explore and analyze code evaluation data

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Paper • 2406.15877 • Published Jun 22, 2024 • 46

Execution-Based Evaluation for Open-Domain Code Generation

Paper • 2212.10481 • Published Dec 20, 2022 • 1

On Leakage of Code Generation Evaluation Datasets

Paper • 2407.07565 • Published Jul 10, 2024 • 6

Note contamination directly: every humaneval problem is at least 43 times on github contamination indirectly: high similarity with synthetic rephrases for instruction learning overfitting on benchmarks: humaneval and MBPP should be considered "dev sets" also add LBPP (less basic ....)

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models

Paper • 2403.04811 • Published Mar 6, 2024

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Paper • 2308.01861 • Published Aug 3, 2023 • 1

Lost in Translation: A Study of Bugs Introduced by Large Language Models while Translating Code

Paper • 2308.03109 • Published Aug 6, 2023

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

Paper • 2404.07940 • Published Mar 11, 2024

Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM

Paper • 2403.19114 • Published Mar 28, 2024

A Systematic Evaluation of Large Language Models of Code

Paper • 2202.13169 • Published Feb 26, 2022 • 1

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Paper • 2407.06153 • Published Jul 8, 2024

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Paper • 2408.10914 • Published Aug 20, 2024 • 42

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

Paper • 2211.11501 • Published Nov 18, 2022

mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation

Paper • 2410.15037 • Published Oct 19, 2024

Big Code Models Leaderboard

Can Ai Code Results

BigCodeBench Leaderboard