Varco Arena

Varco Arena conducts tournaments between models to be compared for each test set command, ranking models accurately at an affordable price. This is more accurate and cost-effective than rating win rates by comparing against reference outputs.

For more information, the followings may help understanding how it works.

Quickstart

Running Web Demo locally (streamlit, Recommended!)

git clone [THIS_REPO]
# install requirements below. we recommend miniforge to manage environment
cd streamlit_app_local
bash run.sh

For more details, see [THIS_REPO]/streamlit_app_local/README.md

CLI use

located at
- varco_arena/
debug configurations for vscode at
- varco_arena/.vscode

## gpt-4o-mini as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -m tournament -e "gpt-4o-mini"
## vllm-openai served LLM as a judge
python main.py -i "./some/dirpath/to/jsonl/files" -o SOME_REL_PATH_TO_CREATE -e SOME_MODEL_NAME_SERVED -m tournament -u "http://url_to/your/vllm_openai_server:someport"

# dbg lines
## openai api judge dbg
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## other testing lines
python main.py -i "rsc/inputs_for_dbg/[SOME_DIRECTORY]/" -o SOME_WANTED_TARGET_DIR -e gpt-4o-mini
## dummy judge dbg (checking errors without api requests)
python main.py -i "rsc/inputs_for_dbg/dbg_400_error_inputs/" -o SOME_WANTED_TARGET_DIR -e debug

Requirements

We tested this on python = 3.11.9 env: requirements.txt

openai>=1.17.0
munch
pandas
numpy
tqdm>=4.48.0
plotly
scikit-learn
kaleido
tiktoken>=0.7.0
pyyaml
transformers
streamlit>=1.40.2
openpyxl
fire==0.6.0
git+https://github.com/shobrook/openlimit.git#egg=openlimit # do not install this by pypi

# Linux
uvloop
# Windows
winloop

Argument

-i, --input : directory path which contains input jsonlines files (llm outputs)
-o, --output_dir : directory where results to be put
-e, --evaluation : judge model specification (e.g. "gpt-4o-2024-05-13", "gpt-4o-mini", [vllm-served-model-name])
-k, --openai_api_key : OpenAI API Key
-u, --openai_url: URL to openai_styled_llm_server (requested by openai sdk)

advanced

-j, --n_jobs : n jobs to be put to asyncio.semaphore(n=)
-p, --evalprompt : see the directory
-lr, --limit_requests : vLLM OpenAI server request limit (default: 7,680)
-lt, --limit_tokens : vLLM OpenAI server token limit (default: 15,728,640)

Input Data Format

input jsonl guides

Contributing & Customizing

Do this after git clone and installation

pip install pre-commit
pre-commit install

before commit

bash precommit.sh # black formatter will reformat the codes

FAQ

I want to apply my custom judge prompt to run Varco Arena
- ./varco_arena/prompts/ defines the prompts with yaml file and the class objects for those. Edit those as your need.
I want tailored judge prompts for each line of the test set row (i.e. ~~100th row - prompt1, 101st~~ - prompt2)
- You could see load_prompt at the above link receives promptname + task as a parameters to load the prompt. The function is called at ./varco_arena/manager.py:async_run.
I want more fields for my llm outputs jsonl files for tailored use, i.e. want more fields beyond instruction, source, generated.
- It's going to get tricky but let me briefly guide you about this.
  - You might have to edit varco_arena/eval_utils.py:async_eval_w_prompt (this part calls PROMPT_OBJ.complete_prompt())
  - And all the related codes will require revision.

Special Thanks to (contributors)

Minho Lee (@Dialogue Model Team, NCSOFT) github
- query wrapper
- rag prompt
Jumin Oh (@Generation Model Team, NCSOFT)
- overall prototyping of the system in haste

Citation

If you found our work helpful, consider citing our paper!

@misc{son2024varcoarenatournamentapproach,
      title={Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models},
      author={Seonil Son and Ju-Min Oh and Heegon Jin and Cheolhun Jang and Jeongbeom Jeong and Kuntae Kim},
      year={2024},
      eprint={2411.01281},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.01281},
}