|
## 1. Introduction |
|
|
|
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**. |
|
|
|
|
|
|
|
## 2. Setup |
|
|
|
``` |
|
pip install accelerate |
|
pip install attrdict |
|
pip install transformers |
|
pip install pytorch |
|
``` |
|
|
|
|
|
## 3. Evaluation |
|
|
|
We've created a sample script, **eval.sh**, that demonstrates how to test the **DeepSeek-Coder-1.3b-Base** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs. |
|
|
|
Additionally, for various programming languages, the execution path may differ. Please ensure you update the appropriate paths in the **humaneval/execution.py** file accordingly. |
|
|
|
```bash |
|
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base" |
|
DATASET_ROOT="data/" |
|
LANGUAGE="python" |
|
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --language ${LANGUAGE} --dataroot ${DATASET_ROOT} |
|
``` |
|
|
|
To evaluate the instruction-based model, please follow the script below: |
|
```bash |
|
LANG="python" |
|
OUPUT_DIR="output" |
|
MODEL="deepseek-coder-33b-instruct" |
|
|
|
CUDA_VISIBLE_DEVICES=0,1 python eval_instruct.py \ |
|
--model "deepseek-ai/$MODEL" \ |
|
--output_path "$OUPUT_DIR/${LANG}.$MODEL.jsonl" \ |
|
--language $LANG \ |
|
--temp_dir $OUPUT_DIR |
|
``` |
|
|
|
## 4. Experimental Results |
|
|
|
We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**. |
|
|
|
|
|
#### (1) Multilingual Base Models |
|
|
|
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg | |
|
|-------------------|------|--------|-------|------|------|------|------|------|------|------| |
|
| code-cushman-001 | 12B | 33.5% | 31.9% | 30.6%| 28.9%| 31.3%| 22.1%| 11.7%| - | - | |
|
| CodeShell | 7B | 35.4% | 32.9% | 34.2%| 31.7%| 30.2%| 38.0%| 7.0% | 33.5%| 30.4%| |
|
| CodeGeeX2 | 6B | 36.0% | 29.2% | 25.9%| 23.6%| 20.8%| 29.7%| 6.3% | 24.8%| 24.5%| |
|
| StarCoderBase | 16B | 31.7% | 31.1% | 28.5%| 25.4%| 34.0%| 34.8%| 8.9% | 29.8%| 28.0%| |
|
| CodeLLama | 7B | 31.7% | 29.8% | 34.2%| 23.6%| 36.5%| 36.7%| 12.0%| 29.2%| 29.2%| |
|
| CodeLLama | 13B | 36.0% | 37.9% | 38.0%| 34.2%| 45.2%| 43.0%| 16.5%| 32.3%| 35.4%| |
|
| CodeLLama | 34B | 48.2% | 44.7% | 44.9%| 41.0%| 42.1%| 48.7%| 15.8%| 42.2%| 41.0%| |
|
| | | | | | | | | | | | |
|
| DeepSeek-Coder-Base| 1.3B | 34.8% | 31.1% | 32.3%| 24.2%| 28.9%| 36.7%| 10.1%| 28.6%| 28.3%| |
|
| DeepSeek-Coder-Base| 5.7B | 48.7% | 45.3% | 41.1%| 39.7%| 44.7%| 41.1%| 27.8%| 42.2%| 41.3%| |
|
| DeepSeek-Coder-Base| 6.7B | 49.4% | 50.3% | 43.0%| 38.5%| 49.7%| 50.0%| 28.5%| 48.4%| 44.7%| |
|
| DeepSeek-Coder-Base|33B | **56.1%** | **58.4%** | **51.9%**| **44.1%**| **52.8%**| **51.3%**| **32.3%**| **55.3%**| **50.3%**| |
|
|
|
#### (2) Instruction-Tuned Models |
|
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg | |
|
|---------------------|------|--------|-------|------|------|------|------|------|------|------| |
|
| GPT-3.5-Turbo | - | 76.2% | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%| |
|
| GPT-4 | - | **84.1%** | **76.4%** | **81.6%**| **77.2%**| **77.4%**| **79.1%**| **58.2%**| **78.0%**| **76.5%**| |
|
| | | | | | | | | | | | |
|
| DeepSeek-Coder-Instruct | 1.3B | 65.2% | 45.3% | 51.9% | 45.3% | 59.7% |55.1% | 12.7% | 52.2% | 48.4% | |
|
| DeepSeek-Coder-Instruct | 6.7B | 78.9% | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%| |
|
| DeepSeek-Coder-Instruct | 33B | **79.3%** | **68.9%** | **73.4%** | **72.7%**| **67.9%**| **74.1%**| **43.0%**| **73.9%**| **69.2%**| |
|
|
|
|