Spaces:
Runtime error
Runtime error
title: Code Eval | |
emoji: 🤗 | |
colorFrom: blue | |
colorTo: red | |
sdk: gradio | |
sdk_version: 3.19.1 | |
app_file: app.py | |
pinned: false | |
tags: | |
- evaluate | |
- metric | |
description: >- | |
This metric implements code evaluation with execution across multiple languages as used in the paper "OctoPack: Instruction Tuning | |
Code Large Language Models" (https://arxiv.org/abs/2308.07124). | |
# Metric Card for Code Eval | |
## Metric description | |
The CodeEval metric estimates the pass@k metric for code synthesis. | |
It implements the code exection for HumanEvalPack as described in the paper ["OctoPack: Instruction Tuning Code Large Language Model"](https://arxiv.org/abs/2308.07124). | |
## How to use | |
The Code Eval metric calculates how good are predictions given a set of references. Its arguments are: | |
`predictions`: a list of candidates to evaluate. Each candidate should be a list of strings with several code candidates to solve the problem. | |
`references`: a list with a test for each prediction. Each test should evaluate the correctness of a code candidate. | |
`k`: number of code candidates to consider in the evaluation. The default value is `[1, 10, 100]`. | |
`num_workers`: the number of workers used to evaluate the candidate programs (The default value is `4`). | |
`timeout`: The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds). | |
`language`: Which language to execute the code in. The default value is `python` and alternatives are `javascript`, `java`, `go`, `cpp`, `rust` | |
`cargo_string`: The cargo installations to perform for Rust. Defaults to some basic packages, see `code_eval_octopack.py`. | |
```python | |
from evaluate import load | |
code_eval = load("Muennighoff/code_eval_octopack") | |
test_cases = ["assert add(2,3)==5"] | |
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]] | |
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python") | |
``` | |
N.B. | |
This metric exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Before running this metric and once you've taken the necessary precautions, you will need to set the `HF_ALLOW_CODE_EVAL` environment variable. Use it at your own risk: | |
```python | |
import os | |
os.environ["HF_ALLOW_CODE_EVAL"] = "1"` | |
``` | |
## Output values | |
The Code Eval metric outputs two things: | |
`pass_at_k`: a dictionary with the pass rates for each k value defined in the arguments. | |
`results`: a dictionary with granular results of each unit test. | |
## Examples | |
Full match at `k=1`: | |
```python | |
from evaluate import load | |
code_eval = load("Muennighoff/code_eval_octopack") | |
test_cases = ["assert add(2,3)==5"] | |
candidates = [["def add(a, b): return a+b"]] | |
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python") | |
print(pass_at_k) | |
{'pass@1': 1.0} | |
``` | |
No match for k = 1: | |
```python | |
from evaluate import load | |
code_eval = load("Muennighoff/code_eval_octopack") | |
test_cases = ["assert add(2,3)==5"] | |
candidates = [["def add(a,b): return a*b"]] | |
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python") | |
print(pass_at_k) | |
{'pass@1': 0.0} | |
``` | |
Partial match at k=1, full match at k=2: | |
```python | |
from evaluate import load | |
code_eval = load("Muennighoff/code_eval_octopack") | |
test_cases = ["assert add(2,3)==5"] | |
candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]] | |
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python") | |
print(pass_at_k) | |
{'pass@1': 0.5, 'pass@2': 1.0} | |
``` | |
## Citation | |
```bibtex | |
@article{muennighoff2023octopack, | |
title={OctoPack: Instruction Tuning Code Large Language Models}, | |
author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre}, | |
journal={arXiv preprint arXiv:2308.07124}, | |
year={2023} | |
} | |
``` | |