Spaces:
Sleeping
Sleeping
title: alpaca-bt-eval | |
app_file: app.py | |
sdk: gradio | |
sdk_version: 4.19.1 | |
[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM | |
evaluation framework. It maintains a set of prompts, along with | |
responses to those prompts from a collection of LLMs. It then presents | |
pairs of responses to a judge that determines which response better | |
addresses the prompt. Rather than compare all response pairs, the | |
framework identifies a baseline model and compares all models to | |
that. The standard method of ranking models is to sort by baseline | |
model win percentage. | |
This Space presents an alternative method of ranking based on the | |
[Bradley–Terry | |
model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model) | |
(BT). Given a collection of items, Bradley–Terry estimates the | |
_ability_ of each item based on pairwise comparisons between them. In | |
sports, for example, that might be the ability of a given team based | |
on games that team has played within a league. Once calculated, | |
ability can be used to estimate the probability that one item will be | |
better-than another, even if those items have yet to be formally | |
compared. | |
The Alpaca project presents a good opportunity to apply BT in | |
practice; especially since BT fits nicely into a Bayesian analysis | |
framework. As LLMs become more pervasive, quantifying the uncertainty | |
in their evaluation is increasingly important. Bayesian frameworks are | |
good at that. | |
This Space is divided into two primary sections: the first presents a | |
ranking of models based on estimated ability. The figure on the right | |
presents this ranking for the top 10 models, while the table below | |
presents the full set. The second section estimates the probability | |
that one model will be preferred to another. A final section at the | |
bottom is a disclaimer that presents details about the workflow. | |