Evaluation
This repository contains code for evaluating the performance of LLM models through both single-turn and multi-turn scenarios.
To set up the environment, you can install the required dependencies by running:
pip install -r evaluation/requirements.txt
Single-turn Evaluation
For single-turn evaluation, the processes of inference, post-processing, and result aggregation are separated as follows:
- Execute
bash evaluation/evaluate/scripts/01_gen_single.sh
to generate model results. - Perform post-processing on the model output by executing
bash evaluation/evaluate/scripts/02_sanitize_single.sh
. - Finally, compute evaluation metrics by executing
bash evaluation/evaluate/scripts/03_eval_single.sh
.
Multi-turn Evaluation
Multi-turn Evaluation with Execution Feedback
Evaluate the performance of the models with execution feedback using the provided scripts:
For OpenCodeInterpreter:
bash evaluation/evaluate/scripts/04_execution_feedback_multiround_OpenCodeInterpreter.sh
For OpenAI's GPT Models: Before proceeding with evaluation, ensure to implement the
get_predict
function inchat_with_gpt.py
to enable interaction with the GPT Models. Then, execute the following script:bash evaluation/evaluate/scripts/05_execution_feedback_multiround_gpt.sh
Multi-turn Evaluation with GPT-4 Simulated Human Feedback
Execute either of the following scripts to evaluate the models with simulated human feedback:
For OpenCodeInterpreter:
bash evaluation/evaluate/scripts/06_human_feedback_multiround_OpenCodeInterpreter.sh
For Oracle OpenCodeInterpreter:
bash evaluation/evaluate/scripts/07_human_feedback_multiround_Oracle_OpenCodeInterpreter.sh
These scripts facilitate the multi-turn evaluation with simulated human feedback.
This evaluation code is based on EvalPlus and has been modified for specific purposes. We extend our gratitude to the contributors of EvalPlus for their foundational work.