File size: 5,967 Bytes
3ba832b
 
 
 
 
34bef94
 
 
 
3ba832b
34bef94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ebb2ea
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3677ce0
34bef94
 
 
3ba832b
2940af9
 
 
 
 
 
 
 
 
34bef94
 
 
 
 
 
 
 
 
2e0e542
 
2940af9
3ba832b
 
2940af9
3ba832b
 
 
4c379fb
3ba832b
 
 
 
 
 
 
 
 
 
 
 
 
2940af9
 
3ba832b
2940af9
3ba832b
 
 
 
2940af9
 
4c379fb
2e0e542
 
 
a0f2435
5a1fe8c
2e0e542
3677ce0
3ba832b
2940af9
3ba832b
 
5a1fe8c
 
34bef94
 
5a1fe8c
3ba832b
 
34bef94
 
 
 
 
 
 
 
 
 
3ba832b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import streamlit as st
import requests
from collections import defaultdict
import pandas as pd

header = """Toloka compares and ranks LLM output in multiple categories, using Guanaco 13B as the baseline.

We use human evaluation to rate model responses to real prompts."""

description = """The Toloka LLM leaderboard provides a human evaluation framework. Here, we invite annotators from the [Toloka](https://toloka.ai/) crowdsourcing platform to assess the model's responses. For this purpose, responses are generated by open-source LLMs based on a dataset of real-world user prompts. These prompts are categorized as per the [InstructGPT paper](https://arxiv.org/abs/2203.02155). Subsequently, annotators evaluate these responses in the manner of [AlpacaEval](https://tatsu-lab.github.io/alpaca_eval/). It's worth noting that we employ [Guanaco 13B](https://huggingface.co/timdettmers/guanaco-13b) instead of text-davinci-003. This is because Guanaco 13B is the closest counterpart to the now-deprecated text-davinci-003 in AlpacaEval.
The metrics on the leaderboard represent the win rate of the respective model in comparison to Guanaco 13B across various prompt categories. The "all" category denotes the aggregation of all prompts and is not a mere average of metrics from individual categories.

### Methodology

#### Which Models Did We Select and Why?

We evaluate open-source models available on HuggingFace Hub and OpenAI models to provide the reference values. The initial list consists of the most popular models, we expect the community to suggest the models we need to add.

#### How Did We Collect the Prompts and Why?

We find it’s tricky to use open-source datasets of prompts due to the following issues:
1. They are often too simple and don’t reflect real users' needs and a wide range of different forms they might provide the tasks to LLMs in.
2. Open-source datasets might be used accidentally or on purpose in the training sets of LLMs, so evaluation won’t be reliable.

To mitigate these issues, we collected our own dataset of prompts consisting of prompts Toloka employees sent to ChatGPT and paraphrased real-world conversations with ChatGPT we found on the internet. This way we ensure that prompts represent real-world use-cases and they are not leaked into LLMs training sets. For the same reasons, we decided not to release the full evaluation set.

Distribution of prompts by categories:

* Brainstorming: 15.48%
* Chat: 1.59%
* Classification: 0.2%
* Closed QA: 3.77%
* Extraction: 0.6%
* Generation: 38.29%
* Open QA: 32.94%
* Rewrite: 5.16%
* Summarization: 1.98%

We report win rates only on categories where the number of prompts is large enough to make a comparison fair.


#### How Did We Set Up Human Evaluation?

Annotators on Toloka crowdsourcing platform are given a prompt and responses to this prompt from two different models: the reference model and a model that we evaluate. Annotators then choose the best response according to harmlessness, truthfulness, and helpfulness. In simple words, we follow the Alpaca Eval scheme but instead of GPT-4, we use real humans as annotators.
"""

pretty_category_names = {
    "all": "Total",
    "brainstorming": "Brainstorming",
    "closed_qa": "Closed QA",
    "generation": "Generation",
    "open_qa": "Open QA",
    "rewrite": "Rewrite",
}

pretty_model_names = {
    "gpt-4": "GPT-4",
    "WizardLM/WizardLM-13B-V1.2": "WizardLM 13B V1.2",
    "meta-llama/Llama-2-70b-chat-hf": "LLaMA 2 70B Chat",
    "gpt-3.5-turbo": "GPT-3.5 Turbo",
    "lmsys/vicuna-33b-v1.3": "Vicuna 33B V1.3",
    "timdettmers/guanaco-13b": "Guanaco 13B",
}

reference_model_name = "timdettmers/guanaco-13b"


leaderboard_results = requests.get("https://llmleaderboard.blob.core.windows.net/llmleaderboard/evaluation_resuls.json").json()
categories = list(leaderboard_results.keys())
pretty_categories = [pretty_category_names[category] for category in categories if category in pretty_category_names]
categories.sort()
models = set()


model_ratings = defaultdict(dict)
for category in categories:
    for entry in leaderboard_results[category]:
        model = entry['model']
        models.add(model)
        model_ratings[model][category] = entry['rating']


table = []

for model in models:
    row = [model]
    for category in categories:
        if category not in pretty_category_names:
            continue
        if category not in model_ratings[model]:
            row.append(0.0)
        else:
            row.append(model_ratings[model][category] * 100)
    table.append(row)

table = pd.DataFrame(table, columns=['Model'] + pretty_categories)
table = table.sort_values(by=['Total'], ascending=False)
table = table.head(5)

# Add row with reference model
row = [reference_model_name] + [50.0] * len(pretty_categories)
table = pd.concat([table, pd.DataFrame([pd.Series(row, index=table.columns)])], ignore_index=True)
table = table.sort_values(by=['Total'], ascending=False)

table.index = ["🥇 1", "🥈 2", "🥉 3"] + list(range(4, len(table) + 1))

for category in pretty_category_names.values():
    table[category] = table[category].map('{:,.2f}%'.format)

avg_token_counts = requests.get("https://llmleaderboard.blob.core.windows.net/llmleaderboard/token_count.json").json()
table['Avg. Response Length'] = [int(avg_token_counts[model]) if model != reference_model_name else int(avg_token_counts["TheBloke/guanaco-13B-HF"]) for model in table['Model']]
table['HF Hub Link'] = [f"https://huggingface.co/{model}" if "/" in model else "" for model in table["Model"]]
table["Model"] = [pretty_model_names[model] if model in pretty_model_names else model for model in table["Model"]]

st.set_page_config(layout="wide")
st.title('Toloka LLM Leaderboard')
st.markdown(header)
st.dataframe(
    table,
    column_config={
        "HF Hub Link": st.column_config.LinkColumn(
            "HF Hub Link",
            help="HF Hub Link",
        )
    }
)
st.markdown(description)