Evaluation

This repository contains code for evaluating the performance of LLM models through both single-turn and multi-turn scenarios.

To set up the environment, you can install the required dependencies by running:

pip install -r evaluation/requirements.txt

Single-turn Evaluation

For single-turn evaluation, the processes of inference, post-processing, and result aggregation are separated as follows:

Execute bash evaluation/evaluate/scripts/01_gen_single.sh to generate model results.
Perform post-processing on the model output by executing bash evaluation/evaluate/scripts/02_sanitize_single.sh.
Finally, compute evaluation metrics by executing bash evaluation/evaluate/scripts/03_eval_single.sh.

Multi-turn Evaluation

Multi-turn Evaluation with Execution Feedback

Evaluate the performance of the models with execution feedback using the provided scripts:

For OpenCodeInterpreter:

bash evaluation/evaluate/scripts/04_execution_feedback_multiround_OpenCodeInterpreter.sh

For OpenAI's GPT Models: Before proceeding with evaluation, ensure to implement the get_predict function in chat_with_gpt.py to enable interaction with the GPT Models. Then, execute the following script:
```
bash evaluation/evaluate/scripts/05_execution_feedback_multiround_gpt.sh
```

Multi-turn Evaluation with GPT-4 Simulated Human Feedback

Execute either of the following scripts to evaluate the models with simulated human feedback:

For OpenCodeInterpreter:

bash evaluation/evaluate/scripts/06_human_feedback_multiround_OpenCodeInterpreter.sh

For Oracle OpenCodeInterpreter:

bash evaluation/evaluate/scripts/07_human_feedback_multiround_Oracle_OpenCodeInterpreter.sh

These scripts facilitate the multi-turn evaluation with simulated human feedback.

This evaluation code is based on EvalPlus and has been modified for specific purposes. We extend our gratitude to the contributors of EvalPlus for their foundational work.