# Code Interpreter Benchmark ## Introduction To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities. ### Metrics The metrics are divided into two parts: code executability and code correctness. - Code executability: evaluating the ability of the LLM-generated code to be executed. - Code correctness: evaluating whether the LLM-generated code runs correctly. ### Domain When evaluating the accuracy of the code execution results for code correctness, we further divide it into two specific domains: `Math`, `Visualization`. In terms of code executability, we calculate executable rate of the generated code for `General problem-solving`. ## Results - Qwen-7B-Chat refers to the version updated after September 25, 2023. - The code correctness judger model for `Visualization` has changed from `Qwen-vl-chat` to `gpt-4-vision-preview` in the version 20231206.

In-house Code Interpreter Benchmark (Version 20231206)
Model	Accuracy of Code Execution Results (%)			Executable Rate of Code (%)
Model	Math↑	Visualization-Hard↑	Visualization-Easy↑	General↑
GPT-4	82.8	66.7	60.8	82.8
GPT-3.5	47.3	33.3	55.7	74.1
LLaMA2-13B-Chat	8.3	1.2	15.2	48.3
CodeLLaMA-13B-Instruct	28.2	15.5	21.5	74.1
InternLM-20B-Chat	34.6	10.7	24.1	65.5
ChatGLM3-6B	54.2	4.8	15.2	62.1
Qwen-1.8B-Chat	25.6	21.4	22.8	65.5
Qwen-7B-Chat	41.9	23.8	38.0	67.2
Qwen-14B-Chat	58.4	31.0	45.6	65.5
Qwen-72B-Chat	72.7	41.7	43.0	82.8

Furthermore, we also provide the results of `Qwen-vl-plus` as the code correctness judger model for `Visualization` task to serve as a reference.

Code Correctness Judger Model = Qwen-vl-plus
Model	Accuracy of Code Execution Results (%)
Model	Visualization-Hard↑	Visualization-Easy↑
LLaMA2-13B-Chat	2.4	17.7
CodeLLaMA-13B-Instruct	17.9	34.2
InternLM-20B-Chat	9.5	31.7
ChatGLM3-6B	10.7	29.1
Qwen-1.8B-Chat	32.1	32.9
Qwen-7B-Chat	26.2	39.2
Qwen-14B-Chat	36.9	41.8
Qwen-72B-Chat	38.1	38.0

## Usage ### Installation ```shell git clone https://github.com/QwenLM/Qwen-Agent.git cd benchmark pip install -r requirements.txt ``` ### Dataset Download ```shell cd benchmark wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip unzip benchmark_code_interpreter_data.zip mkdir eval_data mv eval_code_interpreter_v1.jsonl eval_data/ ``` ### Evaluation To reproduce the comprehensive results of benchmark, you can run the following script: ```Shell python inference_and_execute.py --model {model_name} ``` {model_name}: - qwen-1.8b-chat - qwen-7b-chat - qwen-14b-chat - qwen-72b-chat - llama-2-7b-chat - llama-2-13b-chat - codellama-7b-instruct - codellama-13b-instruct - internlm-7b-chat-1.1 - internlm-20b-chat The benchmark will run the test cases and generate the performance results. The results will be saved in the `output_data` directory. **Notes**: Please install `simhei.ttf` font for proper display in matplotlib when evaluating visualization task. You can do this by preparing `simhei.ttf` (which can be found on any Windows PC) and then running the following code snippet: ```python import os import matplotlib target_font_path = os.path.join( os.path.abspath( os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)), 'fonts', 'ttf', 'simhei.ttf') os.system(f'cp simhei.ttf {target_font_path}') font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json') os.system(f'rm -f {font_list_cache}') ``` #### Code Executable Rate ```Shell python inference_and_execute.py --task {task_name} --model {model_name} ``` {task_name}: - `general`: General problem-solving task #### Code Correctness Rate ```Shell python inference_and_execute.py --task {task_name} --model {model_name} ``` {task_name}: - `visualization`: Visualization task - `gsm8k`: Math task ## Configuration The inference_and_exec.py file contains the following configurable options: - `--model`: The model to test which can be one of `qwen-72b-chat`, `qwen-14b-chat`, `qwen-7b-chat`, `qwen-1.8b-chat`, `qwen-7b-chat`, `llama-2-7b-chat`, `llama-2-13b-chat`, `codellama-7b-instruct`, `codellama-13b-instruct`, `internlm-7b-chat-1.1`, `internlm-20b-chat`. - `--task`: The test task which can be one of `all`, `visualization`, `general`, `gsm8k`. - `--output-path`: The path for saving evaluation result. - `--input-path`: The path for placing evaluation data. - `--output-fname`: The file name for evaluation result. - `--input-fname`: The file name for evaluation data. - `--force`: Force generation and will overwrite the cached results. - `--eval-only`: Only calculate evaluation metrics without re-inference. - `--eval-code-exec-only`: Only evaluate code executable rate - `--gen-exec-only`: Only generate and execuate code without calculating evaluation metrics. - `--gen-only`: Only generate without execuating code and calculating evaluation metrics. - `--vis-judger`: The model to judge the result correctness for `Visualization` task which can be one of `gpt-4-vision-preview`, `qwen-vl-chat`, `qwen-vl-plus`. It is set to `gpt-4-vision-preview` by default in the version 20231206, and `Qwen-vl-chat` has been deprecated.