jerome-white commited on
Commit
8d5a97a
1 Parent(s): eb40ea1

Initial commit

Browse files
Files changed (4) hide show
  1. DISCLAIMER.md +23 -0
  2. README.md +33 -8
  3. app.py +227 -0
  4. requirements.txt +5 -0
DISCLAIMER.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Disclaimer
2
+
3
+ This Space is primarily intended for exploration. Until otherwise
4
+ stated, its results should be treated as points of reference rather
5
+ than absolute fact. Viewers are encouraged to study the pipeline and
6
+ understand the model before broadcasting strong opinions of model
7
+ rankings based on what is seen here. Suggestions for improving this
8
+ Space from those familiar with Alpaca or Bayesian data analysis are
9
+ welcome!
10
+
11
+ ## Resources
12
+
13
+ * [Source code](https://github.com/jerome-white/alpaca-bda) for
14
+ producing results
15
+
16
+ ## TODO
17
+
18
+ [] Extend the Stan model to incorporate ties and response presentation
19
+ ordering
20
+
21
+ [] Add details of the MCMC chains
22
+
23
+ [] Automate data processing
README.md CHANGED
@@ -1,13 +1,38 @@
1
  ---
2
- title: Alpaca Bt Eval
3
- emoji: 🌍
4
- colorFrom: green
5
- colorTo: blue
6
  sdk: gradio
7
  sdk_version: 4.19.1
8
- app_file: app.py
9
- pinned: false
10
- license: apache-2.0
11
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
1
  ---
2
+ title: alpaca-bt-eval
3
+ app_file: app.py
 
 
4
  sdk: gradio
5
  sdk_version: 4.19.1
 
 
 
6
  ---
7
+ [Alpaca](https://github.com/tatsu-lab/alpaca_eval) is an LLM
8
+ evaluation framework. It maintains a set of prompts, along with
9
+ responses to those prompts from a collection of LLMs. It then presents
10
+ pairs of responses to a judge that determines which response better
11
+ addresses the prompt. Rather than compare all response pairs, the
12
+ framework identifies a baseline model and compares all models to
13
+ that. The standard method of ranking models is to sort by baseline
14
+ model win percentage.
15
+
16
+ This Space presents an alternative method of ranking based on the
17
+ [Bradley–Terry
18
+ model](https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model)
19
+ (BT). Given a collection of items, Bradley–Terry estimates the
20
+ _ability_ of each item based on pairwise comparisons between them. In
21
+ sports, for example, that might be the ability of a given team based
22
+ on games that team has played within a league. Once calculated,
23
+ ability can be used to estimate the probability that one item will be
24
+ better-than another, even if those items have yet to be formally
25
+ compared.
26
+
27
+ The Alpaca project presents a good opportunity to apply BT in
28
+ practice; especially since BT fits nicely into a Bayesian analysis
29
+ framework. As LLMs become more pervasive, quantifying the uncertainty
30
+ in their evaluation is increasingly important. Bayesian frameworks are
31
+ good at that.
32
 
33
+ This Space is divided into two primary sections: the first presents a
34
+ ranking of models based on estimated ability. The figure on the right
35
+ presents this ranking for the top 10 models, while the table below
36
+ presents the full set. The second section estimates the probability
37
+ that one model will be preferred to another. A final section at the
38
+ bottom is a disclaimer that presents details about the workflow.
app.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import operator as op
3
+ import itertools as it
4
+ import functools as ft
5
+ import collections as cl
6
+ from pathlib import Path
7
+
8
+ import pandas as pd
9
+ import gradio as gr
10
+ import seaborn as sns
11
+ from datasets import load_dataset
12
+ from scipy.special import expit
13
+
14
+ HDI = cl.namedtuple('HDI', 'lower, upper')
15
+
16
+ #
17
+ # See https://cran.r-project.org/package=HDInterval
18
+ #
19
+ def hdi(values, ci=0.95):
20
+ values = sorted(filter(math.isfinite, values))
21
+ if not values:
22
+ raise ValueError('Empty data set')
23
+
24
+ n = len(values)
25
+ exclude = n - math.floor(n * ci)
26
+
27
+ left = it.islice(values, exclude)
28
+ right = it.islice(values, n - exclude, None)
29
+
30
+ diffs = ((x, y, y - x) for (x, y) in zip(left, right))
31
+ (*args, _) = min(diffs, key=op.itemgetter(-1))
32
+
33
+ return HDI(*args)
34
+
35
+ #
36
+ #
37
+ #
38
+ def load(repo):
39
+ parameter = 'parameter'
40
+ items = [
41
+ 'chain',
42
+ 'sample',
43
+ parameter,
44
+ 'model',
45
+ 'value',
46
+ ]
47
+ dataset = load_dataset(repo)
48
+
49
+ return (dataset
50
+ .get('train')
51
+ .to_pandas()
52
+ .filter(items=items)
53
+ .query(f'{parameter} == "alpha"')
54
+ .drop(columns=parameter))
55
+
56
+ def summarize(df, ci=0.95):
57
+ def _aggregate(i, g):
58
+ values = g['value']
59
+ interval = hdi(values, ci)
60
+
61
+ agg = {
62
+ 'model': i,
63
+ 'ability': values.median(),
64
+ 'uncertainty': interval.upper - interval.lower,
65
+ }
66
+ agg.update(interval._asdict())
67
+
68
+ return agg
69
+
70
+ groups = df.groupby('model', sort=False)
71
+ records = it.starmap(_aggregate, groups)
72
+
73
+ return pd.DataFrame.from_records(records)
74
+
75
+ def rank(df, ascending, name='rank'):
76
+ df = (df
77
+ .sort_values(by=['ability', 'uncertainty'],
78
+ ascending=[ascending, not ascending])
79
+ .drop(columns='uncertainty')
80
+ .reset_index(drop=True))
81
+ df.index += 1
82
+
83
+ return df.reset_index(names=name)
84
+
85
+ def compare(df, model_1, model_2):
86
+ mcol = 'model'
87
+ models = [
88
+ model_1,
89
+ model_2,
90
+ ]
91
+ view = (df
92
+ .query(f'{mcol} in @models')
93
+ .pivot(index=['chain', 'sample'],
94
+ columns=mcol,
95
+ values='value'))
96
+
97
+ return expit(view[model_1] - view[model_2])
98
+
99
+ #
100
+ #
101
+ #
102
+ class DataPlotter:
103
+ def __init__(self, df):
104
+ self.df = df
105
+
106
+ def plot(self):
107
+ ax = self.draw()
108
+ ax.grid(visible=True,
109
+ axis='both',
110
+ alpha=0.25,
111
+ linestyle='dotted')
112
+
113
+ fig = ax.get_figure()
114
+ fig.tight_layout()
115
+
116
+ return fig
117
+
118
+ def draw(self):
119
+ raise NotImplementedError()
120
+
121
+ class RankPlotter(DataPlotter):
122
+ _y = 'y'
123
+
124
+ @ft.cached_property
125
+ def y(self):
126
+ return self.df[self._y]
127
+
128
+ def __init__(self, df, top=10):
129
+ view = rank(summarize(df), True, self._y)
130
+ view = (view
131
+ .tail(top)
132
+ .sort_values(by=self._y, ascending=False))
133
+ super().__init__(view)
134
+
135
+ def draw(self):
136
+ ax = self.df.plot.scatter('ability', self._y)
137
+ ax.hlines(self.y,
138
+ xmin=self.df['lower'],
139
+ xmax=self.df['upper'],
140
+ alpha=0.5)
141
+ ax.set_ylabel('')
142
+ ax.set_yticks(self.y, self.df['model'])
143
+
144
+ return ax
145
+
146
+ class ComparisonPlotter(DataPlotter):
147
+ def __init__(self, df, model_1, model_2, ci=0.95):
148
+ super().__init__(compare(df, model_1, model_2))
149
+ self.interval = hdi(self.df, ci)
150
+
151
+ def draw(self):
152
+ ax = sns.ecdfplot(self.df)
153
+
154
+ (_, color, *_) = sns.color_palette()
155
+ ax.axvline(x=self.df.median(),
156
+ color=color,
157
+ linestyle='dashed')
158
+ ax.axvspan(xmin=self.interval.lower,
159
+ xmax=self.interval.upper,
160
+ alpha=0.15,
161
+ color=color)
162
+ ax.set_xlabel('Pr(M$_{1}$ \u003E M$_{2}$)')
163
+
164
+ return ax
165
+
166
+ def cplot(df, ci=0.95):
167
+ def _plot(model_1, model_2):
168
+ cp = ComparisonPlotter(df, model_1, model_2, ci)
169
+ return cp.plot()
170
+
171
+ return _plot
172
+
173
+ #
174
+ #
175
+ #
176
+ with gr.Blocks() as demo:
177
+ df = load('jerome-white/alpaca-bt-stan')
178
+
179
+ gr.Markdown('# Alpaca Bradley–Terry')
180
+ with gr.Row():
181
+ with gr.Column():
182
+ gr.Markdown(Path('README.md').read_text())
183
+
184
+ with gr.Column():
185
+ plotter = RankPlotter(df)
186
+ gr.Plot(plotter.plot())
187
+
188
+ with gr.Row():
189
+ view = rank(summarize(df), False)
190
+ columns = { x: f'HDI {x}' for x in HDI._fields }
191
+ for i in view.columns:
192
+ columns.setdefault(i, i.title())
193
+ view = (view
194
+ .rename(columns=columns)
195
+ .style.format(precision=4))
196
+
197
+ gr.Dataframe(view)
198
+
199
+ with gr.Row():
200
+ with gr.Column(scale=3):
201
+ display = gr.Plot()
202
+
203
+ with gr.Row():
204
+ with gr.Column():
205
+ gr.Markdown('''
206
+
207
+ Probability that Model 1 is preferred to Model 2. The
208
+ solid blue curve is a CDF of that distribution;
209
+ formally the inverse logit of the difference in model
210
+ abilities. The dashed orange vertical line is the
211
+ median, while the band surrounding it is its 95%
212
+ [highest density
213
+ interval](https://cran.r-project.org/package=HDInterval).
214
+
215
+ ''')
216
+ with gr.Column():
217
+ models = sorted(df['model'].unique(), key=lambda x: x.lower())
218
+ drops = ft.partial(gr.Dropdown, choices=models)
219
+ inputs = [ drops(label=f'Model {x}') for x in range(1, 3) ]
220
+
221
+ button = gr.Button(value='Compare!')
222
+ button.click(cplot(df), inputs=inputs, outputs=[display])
223
+
224
+ with gr.Accordion('Disclaimer', open=False):
225
+ gr.Markdown(Path('DISCLAIMER.md').read_text())
226
+
227
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ datasets
2
+ gradio
3
+ pandas
4
+ scipy
5
+ seaborn