--- title: README emoji: 🔥 colorFrom: blue colorTo: purple sdk: static pinned: false --- # EvalPlus: Rigorous Evaluation of LLMs for Code Generation ## About EvalPlus evaluates LLM-generated code on: * Code Correctness: HumanEval+ and MBPP+ * Code Efficiency: EvalPerf ## Resources * 💻 **GitHub Repo**: [evalplus/evalplus](https://github.com/evalplus/evalplus) * 🏆 **Leader Board**: [evalplus.github.io](https://evalplus.github.io) * 📜 **Papers**: [EvalPlus@NeurIPS'23](https://arxiv.org/abs/2305.01210), [EvalPerf@COLM'24](https://arxiv.org/abs/2408.06450) * 🐍 **Python Package**: [PyPI](https://pypi.org/project/evalplus/) ## Citations ```bibtex @inproceedings{evalplus, title = {Is Your Code Generated by Chat{GPT} Really Correct? Rigorous Evaluation of Large Language Models for Code Generation}, author = {Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming}, booktitle = {Thirty-seventh Conference on Neural Information Processing Systems}, year = {2023}, url = {https://openreview.net/forum?id=1qvx610Cu7}, } @inproceedings{evalperf, title = {Evaluating Language Models for Efficient Code Generation}, author = {Liu, Jiawei and Xie, Songrun and Wang, Junhao and Wei, Yuxiang and Ding, Yifeng and Zhang, Lingming}, booktitle = {First Conference on Language Modeling}, year = {2024}, url = {https://openreview.net/forum?id=IBCBMeAhmC}, } ```