Spaces:

jerome-white
/

llm-bradley-terry

Sleeping

App Files Files Community

jerome-white commited on Feb 20

Commit

8d5a97a

•

1 Parent(s): eb40ea1

Initial commit

Browse files

Files changed (4) hide show

DISCLAIMER.md +23 -0
README.md +33 -8
app.py +227 -0
requirements.txt +5 -0

DISCLAIMER.md ADDED Viewed

	@@ -0,0 +1,23 @@

+# Disclaimer
+This Space is primarily intended for exploration. Until otherwise
+stated, its results should be treated as points of reference rather
+than absolute fact. Viewers are encouraged to study the pipeline and
+understand the model before broadcasting strong opinions of model
+rankings based on what is seen here. Suggestions for improving this
+Space from those familiar with Alpaca or Bayesian data analysis are
+welcome!
+## Resources
+* [Source code](https://github.com/jerome-white/alpaca-bda) for
+  producing results
+## TODO
+[] Extend the Stan model to incorporate ties and response presentation
+   ordering
+[] Add details of the MCMC chains
+[] Automate data processing

README.md CHANGED Viewed

@@ -1,13 +1,38 @@
 ---
-title: Alpaca Bt Eval
-emoji: 🌍
-colorFrom: green
-colorTo: blue
 sdk: gradio
 sdk_version: 4.19.1
-app_file: app.py
-pinned: false
-license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: alpaca-bt-eval
+app_file: app.py
 sdk: gradio
 sdk_version: 4.19.1
 ---
+[Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
+evaluation framework. It maintains a set of prompts, along with
+responses to those prompts from a collection of LLMs. It then presents
+pairs of responses to a judge that determines which response better
+addresses the prompt. Rather than compare all response pairs, the
+framework identifies a baseline model and compares all models to
+that. The standard method of ranking models is to sort by baseline
+model win percentage.
+This Space presents an alternative method of ranking based on the
+[Bradley–Terry
+model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
+(BT). Given a collection of items, Bradley–Terry estimates the
+_ability_ of each item based on pairwise comparisons between them. In
+sports, for example, that might be the ability of a given team based
+on games that team has played within a league. Once calculated,
+ability can be used to estimate the probability that one item will be
+better-than another, even if those items have yet to be formally
+compared.
+The Alpaca project presents a good opportunity to apply BT in
+practice; especially since BT fits nicely into a Bayesian analysis
+framework. As LLMs become more pervasive, quantifying the uncertainty
+in their evaluation is increasingly important. Bayesian frameworks are
+good at that.
+This Space is divided into two primary sections: the first presents a
+ranking of models based on estimated ability. The figure on the right
+presents this ranking for the top 10 models, while the table below
+presents the full set. The second section estimates the probability
+that one model will be preferred to another. A final section at the
+bottom is a disclaimer that presents details about the workflow.

app.py ADDED Viewed

	@@ -0,0 +1,227 @@

+import math
+import operator as op
+import itertools as it
+import functools as ft
+import collections as cl
+from pathlib import Path
+import pandas as pd
+import gradio as gr
+import seaborn as sns
+from datasets import load_dataset
+from scipy.special import expit
+HDI = cl.namedtuple('HDI', 'lower, upper')
+#
+# See https://cran.r-project.org/package=HDInterval
+#
+def hdi(values, ci=0.95):
+    values = sorted(filter(math.isfinite, values))
+    if not values:
+        raise ValueError('Empty data set')
+    n = len(values)
+    exclude = n - math.floor(n * ci)
+    left = it.islice(values, exclude)
+    right = it.islice(values, n - exclude, None)
+    diffs = ((x, y, y - x) for (x, y) in zip(left, right))
+    (*args, _) = min(diffs, key=op.itemgetter(-1))
+    return HDI(*args)
+#
+#
+#
+def load(repo):
+    parameter = 'parameter'
+    items = [
+        'chain',
+        'sample',
+        parameter,
+        'model',
+        'value',
+    ]
+    dataset = load_dataset(repo)
+    return (dataset
+            .get('train')
+            .to_pandas()
+            .filter(items=items)
+            .query(f'{parameter} == "alpha"')
+            .drop(columns=parameter))
+def summarize(df, ci=0.95):
+    def _aggregate(i, g):
+        values = g['value']
+        interval = hdi(values, ci)
+        agg = {
+            'model': i,
+            'ability': values.median(),
+            'uncertainty': interval.upper - interval.lower,
+        }
+        agg.update(interval._asdict())
+        return agg
+    groups = df.groupby('model', sort=False)
+    records = it.starmap(_aggregate, groups)
+    return pd.DataFrame.from_records(records)
+def rank(df, ascending, name='rank'):
+    df = (df
+          .sort_values(by=['ability',  'uncertainty'],
+                       ascending=[ascending, not ascending])
+          .drop(columns='uncertainty')
+          .reset_index(drop=True))
+    df.index += 1
+    return df.reset_index(names=name)
+def compare(df, model_1, model_2):
+    mcol = 'model'
+    models = [
+        model_1,
+        model_2,
+    ]
+    view = (df
+            .query(f'{mcol} in @models')
+            .pivot(index=['chain', 'sample'],
+                   columns=mcol,
+                   values='value'))
+    return expit(view[model_1] - view[model_2])
+#
+#
+#
+class DataPlotter:
+    def __init__(self, df):
+        self.df = df
+    def plot(self):
+        ax = self.draw()
+        ax.grid(visible=True,
+                axis='both',
+                alpha=0.25,
+                linestyle='dotted')
+        fig = ax.get_figure()
+        fig.tight_layout()
+        return fig
+    def draw(self):
+        raise NotImplementedError()
+class RankPlotter(DataPlotter):
+    _y = 'y'
+    @ft.cached_property
+    def y(self):
+        return self.df[self._y]
+    def __init__(self, df, top=10):
+        view = rank(summarize(df), True, self._y)
+        view = (view
+                .tail(top)
+                .sort_values(by=self._y, ascending=False))
+        super().__init__(view)
+    def draw(self):
+        ax = self.df.plot.scatter('ability', self._y)
+        ax.hlines(self.y,
+                  xmin=self.df['lower'],
+                  xmax=self.df['upper'],
+                  alpha=0.5)
+        ax.set_ylabel('')
+        ax.set_yticks(self.y, self.df['model'])
+        return ax
+class ComparisonPlotter(DataPlotter):
+    def __init__(self, df, model_1, model_2, ci=0.95):
+        super().__init__(compare(df, model_1, model_2))
+        self.interval = hdi(self.df, ci)
+    def draw(self):
+        ax = sns.ecdfplot(self.df)
+        (_, color, *_) = sns.color_palette()
+        ax.axvline(x=self.df.median(),
+                   color=color,
+                   linestyle='dashed')
+        ax.axvspan(xmin=self.interval.lower,
+                   xmax=self.interval.upper,
+                   alpha=0.15,
+                   color=color)
+        ax.set_xlabel('Pr(M$_{1}$ \u003E M$_{2}$)')
+        return ax
+def cplot(df, ci=0.95):
+    def _plot(model_1, model_2):
+        cp = ComparisonPlotter(df, model_1, model_2, ci)
+        return cp.plot()
+    return _plot
+#
+#
+#
+with gr.Blocks() as demo:
+    df = load('jerome-white/alpaca-bt-stan')
+    gr.Markdown('# Alpaca Bradley–Terry')
+    with gr.Row():
+        with gr.Column():
+            gr.Markdown(Path('README.md').read_text())
+        with gr.Column():
+            plotter = RankPlotter(df)
+            gr.Plot(plotter.plot())
+    with gr.Row():
+        view = rank(summarize(df), False)
+        columns = { x: f'HDI {x}' for x in HDI._fields }
+        for i in view.columns:
+            columns.setdefault(i, i.title())
+        view = (view
+                .rename(columns=columns)
+                .style.format(precision=4))
+        gr.Dataframe(view)
+    with gr.Row():
+        with gr.Column(scale=3):
+            display = gr.Plot()
+        with gr.Row():
+            with gr.Column():
+                gr.Markdown('''
+                Probability that Model 1 is preferred to Model 2. The
+                solid blue curve is a CDF of that distribution;
+                formally the inverse logit of the difference in model
+                abilities. The dashed orange vertical line is the
+                median, while the band surrounding it is its 95%
+                [highest density
+                interval](https://cran.r-project.org/package=HDInterval).
+                ''')
+            with gr.Column():
+                models = sorted(df['model'].unique(), key=lambda x: x.lower())
+                drops = ft.partial(gr.Dropdown, choices=models)
+                inputs = [ drops(label=f'Model {x}') for x in range(1, 3) ]
+                button = gr.Button(value='Compare!')
+                button.click(cplot(df), inputs=inputs, outputs=[display])
+    with gr.Accordion('Disclaimer', open=False):
+        gr.Markdown(Path('DISCLAIMER.md').read_text())
+demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+datasets
+gradio
+pandas
+scipy
+seaborn