arxiv:2412.11314

Reliable, Reproducible, and Really Fast Leaderboards with Evalica

Published on Dec 15

· Submitted by

dustalov on Dec 17

Upvote

Authors:

Dmitry Ustalov

Abstract

The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.

View arXiv page View PDF Add to collection

Community

dustalov

Paper author Paper submitter about 13 hours ago

Tired of waiting on slow leaderboard computations? Struggling to rank machine learning models quickly and accurately? Evalica is a Python library for fast, efficient, and correctly implemented ranking using methods like Elo, Bradley-Terry, average win rate, and more: https://github.com/dustalov/evalica.