upgrade: add benchmarks eval

2a26d3b 28 days ago

4.18 kB

	## 1. Introduction

	We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks. We select the widely-used benchmarks: [HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E).



	## 2. Setup

	```
	pip install accelerate
	pip install attrdict
	pip install transformers
	pip install pytorch
	```


	## 3. Evaluation

	We've created a sample script, eval.sh, that demonstrates how to test the DeepSeek-Coder-1.3b-Base model on the HumanEval dataset leveraging 8 GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.

	Additionally, for various programming languages, the execution path may differ. Please ensure you update the appropriate paths in the humaneval/execution.py file accordingly.

	```bash
	MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
	DATASET_ROOT="data/"
	LANGUAGE="python"
	python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --language ${LANGUAGE} --dataroot ${DATASET_ROOT}
	```

	To evaluate the instruction-based model, please follow the script below:
	```bash
	LANG="python"
	OUPUT_DIR="output"
	MODEL="deepseek-coder-33b-instruct"

	CUDA_VISIBLE_DEVICES=0,1 python eval_instruct.py \
	--model "deepseek-ai/$MODEL" \
	--output_path "$OUPUT_DIR/${LANG}.$MODEL.jsonl" \
	--language $LANG \
	--temp_dir $OUPUT_DIR
	```

	## 4. Experimental Results

	We report experimental results here for 8 main-stream programming languages, python, c++, java, PHP, TypeScript, C#, Bash, and JavaScript. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.


	#### (1) Multilingual Base Models

	\| Model \| Size \| Python \| C++ \| Java \| PHP \| TS \| C# \| Bash \| JS \| Avg \|
	\|-------------------\|------\|--------\|-------\|------\|------\|------\|------\|------\|------\|------\|
	\| code-cushman-001 \| 12B \| 33.5% \| 31.9% \| 30.6%\| 28.9%\| 31.3%\| 22.1%\| 11.7%\| - \| - \|
	\| CodeShell \| 7B \| 35.4% \| 32.9% \| 34.2%\| 31.7%\| 30.2%\| 38.0%\| 7.0% \| 33.5%\| 30.4%\|
	\| CodeGeeX2 \| 6B \| 36.0% \| 29.2% \| 25.9%\| 23.6%\| 20.8%\| 29.7%\| 6.3% \| 24.8%\| 24.5%\|
	\| StarCoderBase \| 16B \| 31.7% \| 31.1% \| 28.5%\| 25.4%\| 34.0%\| 34.8%\| 8.9% \| 29.8%\| 28.0%\|
	\| CodeLLama \| 7B \| 31.7% \| 29.8% \| 34.2%\| 23.6%\| 36.5%\| 36.7%\| 12.0%\| 29.2%\| 29.2%\|
	\| CodeLLama \| 13B \| 36.0% \| 37.9% \| 38.0%\| 34.2%\| 45.2%\| 43.0%\| 16.5%\| 32.3%\| 35.4%\|
	\| CodeLLama \| 34B \| 48.2% \| 44.7% \| 44.9%\| 41.0%\| 42.1%\| 48.7%\| 15.8%\| 42.2%\| 41.0%\|
	\| \| \| \| \| \| \| \| \| \| \| \|
	\| DeepSeek-Coder-Base\| 1.3B \| 34.8% \| 31.1% \| 32.3%\| 24.2%\| 28.9%\| 36.7%\| 10.1%\| 28.6%\| 28.3%\|
	\| DeepSeek-Coder-Base\| 5.7B \| 48.7% \| 45.3% \| 41.1%\| 39.7%\| 44.7%\| 41.1%\| 27.8%\| 42.2%\| 41.3%\|
	\| DeepSeek-Coder-Base\| 6.7B \| 49.4% \| 50.3% \| 43.0%\| 38.5%\| 49.7%\| 50.0%\| 28.5%\| 48.4%\| 44.7%\|
	\| DeepSeek-Coder-Base\|33B \| 56.1% \| 58.4% \| 51.9%\| 44.1%\| 52.8%\| 51.3%\| 32.3%\| 55.3%\| 50.3%\|

	#### (2) Instruction-Tuned Models
	\| Model \| Size \| Python \| C++ \| Java \| PHP \| TS \| C# \| Bash \| JS \| Avg \|
	\|---------------------\|------\|--------\|-------\|------\|------\|------\|------\|------\|------\|------\|
	\| GPT-3.5-Turbo \| - \| 76.2% \| 63.4% \| 69.2%\| 60.9%\| 69.1%\| 70.8%\| 42.4%\| 67.1%\| 64.9%\|
	\| GPT-4 \| - \| 84.1% \| 76.4% \| 81.6%\| 77.2%\| 77.4%\| 79.1%\| 58.2%\| 78.0%\| 76.5%\|
	\| \| \| \| \| \| \| \| \| \| \| \|
	\| DeepSeek-Coder-Instruct \| 1.3B \| 65.2% \| 45.3% \| 51.9% \| 45.3% \| 59.7% \|55.1% \| 12.7% \| 52.2% \| 48.4% \|
	\| DeepSeek-Coder-Instruct \| 6.7B \| 78.9% \| 63.4% \| 68.4% \| 68.9%\| 67.2%\| 72.8%\| 36.7%\| 72.7%\| 66.1%\|
	\| DeepSeek-Coder-Instruct \| 33B \| 79.3% \| 68.9% \| 73.4% \| 72.7%\| 67.9%\| 74.1%\| 43.0%\| 73.9%\| 69.2%\|