Code Interpreter Benchmark

Introduction

To assess LLM's ability to use the Python Code Interpreter for tasks such as mathematical problem solving, data visualization, and other general-purpose tasks such as file handling and web scraping, we have created and open-sourced a benchmark specifically designed for evaluating these capabilities.

Metrics

The metrics are divided into two parts: code executability and code correctness.

Code executability: evaluating the ability of the LLM-generated code to be executed.
Code correctness: evaluating whether the LLM-generated code runs correctly.

Domain

When evaluating the accuracy of the code execution results for code correctness, we further divide it into two specific domains: Math, Visualization. In terms of code executability, we calculate executable rate of the generated code for General problem-solving.

Results

Qwen-7B-Chat refers to the version updated after September 25, 2023.
The code correctness judger model for Visualization has changed from Qwen-vl-chat to gpt-4-vision-preview in the version 20231206.

In-house Code Interpreter Benchmark (Version 20231206)
Model	Accuracy of Code Execution Results (%)			Executable Rate of Code (%)
Model	Math↑	Visualization-Hard↑	Visualization-Easy↑	General↑
GPT-4	82.8	66.7	60.8	82.8
GPT-3.5	47.3	33.3	55.7	74.1
LLaMA2-13B-Chat	8.3	1.2	15.2	48.3
CodeLLaMA-13B-Instruct	28.2	15.5	21.5	74.1
InternLM-20B-Chat	34.6	10.7	24.1	65.5
ChatGLM3-6B	54.2	4.8	15.2	62.1
Qwen-1.8B-Chat	25.6	21.4	22.8	65.5
Qwen-7B-Chat	41.9	23.8	38.0	67.2
Qwen-14B-Chat	58.4	31.0	45.6	65.5
Qwen-72B-Chat	72.7	41.7	43.0	82.8

Furthermore, we also provide the results of Qwen-vl-plus as the code correctness judger model for Visualization task to serve as a reference.

Code Correctness Judger Model = Qwen-vl-plus
Model	Accuracy of Code Execution Results (%)
Model	Visualization-Hard↑	Visualization-Easy↑
LLaMA2-13B-Chat	2.4	17.7
CodeLLaMA-13B-Instruct	17.9	34.2
InternLM-20B-Chat	9.5	31.7
ChatGLM3-6B	10.7	29.1
Qwen-1.8B-Chat	32.1	32.9
Qwen-7B-Chat	26.2	39.2
Qwen-14B-Chat	36.9	41.8
Qwen-72B-Chat	38.1	38.0

Usage

Installation

git clone https://github.com/QwenLM/Qwen-Agent.git
cd benchmark
pip install -r requirements.txt

Dataset Download

cd benchmark
wget https://qianwen-res.oss-cn-beijing.aliyuncs.com/assets/qwen_agent/benchmark_code_interpreter_data.zip
unzip benchmark_code_interpreter_data.zip
mkdir eval_data
mv eval_code_interpreter_v1.jsonl eval_data/

Evaluation

To reproduce the comprehensive results of benchmark, you can run the following script:

python inference_and_execute.py --model {model_name}

{model_name}:

qwen-1.8b-chat
qwen-7b-chat
qwen-14b-chat
qwen-72b-chat
llama-2-7b-chat
llama-2-13b-chat
codellama-7b-instruct
codellama-13b-instruct
internlm-7b-chat-1.1
internlm-20b-chat

The benchmark will run the test cases and generate the performance results. The results will be saved in the output_data directory.

Notes: Please install simhei.ttf font for proper display in matplotlib when evaluating visualization task. You can do this by preparing simhei.ttf (which can be found on any Windows PC) and then running the following code snippet:

import os
import matplotlib
target_font_path = os.path.join(
    os.path.abspath(
        os.path.join(matplotlib.matplotlib_fname(), os.path.pardir)),
        'fonts', 'ttf', 'simhei.ttf')
os.system(f'cp simhei.ttf {target_font_path}')
font_list_cache = os.path.join(matplotlib.get_cachedir(), 'fontlist-*.json')
os.system(f'rm -f {font_list_cache}')

Code Executable Rate

python inference_and_execute.py --task {task_name} --model {model_name}

{task_name}:

general: General problem-solving task

Code Correctness Rate

python inference_and_execute.py --task {task_name} --model {model_name}

{task_name}:

visualization: Visualization task
gsm8k: Math task

Configuration

The inference_and_exec.py file contains the following configurable options:

--model: The model to test which can be one of qwen-72b-chat, qwen-14b-chat, qwen-7b-chat, qwen-1.8b-chat, qwen-7b-chat, llama-2-7b-chat, llama-2-13b-chat, codellama-7b-instruct, codellama-13b-instruct, internlm-7b-chat-1.1, internlm-20b-chat.
--task: The test task which can be one of all, visualization, general, gsm8k.
--output-path: The path for saving evaluation result.
--input-path: The path for placing evaluation data.
--output-fname: The file name for evaluation result.
--input-fname: The file name for evaluation data.
--force: Force generation and will overwrite the cached results.
--eval-only: Only calculate evaluation metrics without re-inference.
--eval-code-exec-only: Only evaluate code executable rate
--gen-exec-only: Only generate and execuate code without calculating evaluation metrics.
--gen-only: Only generate without execuating code and calculating evaluation metrics.
--vis-judger: The model to judge the result correctness for Visualization task which can be one of gpt-4-vision-preview, qwen-vl-chat, qwen-vl-plus. It is set to gpt-4-vision-preview by default in the version 20231206, and Qwen-vl-chat has been deprecated.