code_eval_octopack

Runtime error

App Files Files Community

code_eval_octopack / README.md

Muennighoff

Update README.md

87bb18f over 1 year ago

preview code

raw

history blame

4.07 kB

	---
	title: Code Eval
	emoji: 🤗
	colorFrom: blue
	colorTo: red
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	tags:
	- evaluate
	- metric
	description: >-
	This metric implements code evaluation with execution across multiple languages as used in the paper "OctoPack: Instruction Tuning
	Code Large Language Models" (https://arxiv.org/abs/2308.07124).
	---

	# Metric Card for Code Eval

	## Metric description

	The CodeEval metric estimates the pass@k metric for code synthesis.

	It implements the code exection for HumanEvalPack as described in the paper ["OctoPack: Instruction Tuning Code Large Language Model"](https://arxiv.org/abs/2308.07124).


	## How to use

	The Code Eval metric calculates how good are predictions given a set of references. Its arguments are:

	`predictions`: a list of candidates to evaluate. Each candidate should be a list of strings with several code candidates to solve the problem.

	`references`: a list with a test for each prediction. Each test should evaluate the correctness of a code candidate.

	`k`: number of code candidates to consider in the evaluation. The default value is `[1, 10, 100]`.

	`num_workers`: the number of workers used to evaluate the candidate programs (The default value is `4`).

	`timeout`: The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds).

	`language`: Which language to execute the code in. The default value is `python` and alternatives are `javascript`, `java`, `go`, `cpp`, `rust`

	`cargo_string`: The cargo installations to perform for Rust. Defaults to some basic packages, see `code_eval_octopack.py`.

	```python
	from evaluate import load
	code_eval = load("Muennighoff/code_eval_octopack")
	test_cases = ["assert add(2,3)==5"]
	candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
	pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python")
	```

	N.B.
	This metric exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Before running this metric and once you've taken the necessary precautions, you will need to set the `HF_ALLOW_CODE_EVAL` environment variable. Use it at your own risk:
	```python
	import os
	os.environ["HF_ALLOW_CODE_EVAL"] = "1"`
	```

	## Output values

	The Code Eval metric outputs two things:

	`pass_at_k`: a dictionary with the pass rates for each k value defined in the arguments.

	`results`: a dictionary with granular results of each unit test.

	## Examples

	Full match at `k=1`:

	```python
	from evaluate import load
	code_eval = load("Muennighoff/code_eval_octopack")
	test_cases = ["assert add(2,3)==5"]
	candidates = [["def add(a, b): return a+b"]]
	pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python")
	print(pass_at_k)
	{'pass@1': 1.0}
	```

	No match for k = 1:

	```python
	from evaluate import load
	code_eval = load("Muennighoff/code_eval_octopack")
	test_cases = ["assert add(2,3)==5"]
	candidates = [["def add(a,b): return a*b"]]
	pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1], language="python")
	print(pass_at_k)
	{'pass@1': 0.0}
	```

	Partial match at k=1, full match at k=2:

	```python
	from evaluate import load
	code_eval = load("Muennighoff/code_eval_octopack")
	test_cases = ["assert add(2,3)==5"]
	candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]]
	pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], language="python")
	print(pass_at_k)
	{'pass@1': 0.5, 'pass@2': 1.0}
	```

	## Citation

	```bibtex
	@article{muennighoff2023octopack,
	title={OctoPack: Instruction Tuning Code Large Language Models},
	author={Niklas Muennighoff and Qian Liu and Armel Zebaze and Qinkai Zheng and Binyuan Hui and Terry Yue Zhuo and Swayam Singh and Xiangru Tang and Leandro von Werra and Shayne Longpre},
	journal={arXiv preprint arXiv:2308.07124},
	year={2023}
	}
	```