File size: 2,218 Bytes
47c3ae2
db7f350
 
 
5f9d165
db7f350
a86b728
db7f350
47c3ae2
db7f350
5f9d165
 
 
 
 
 
 
 
db7f350
5f9d165
db7f350
5f9d165
db7f350
 
 
47c3ae2
db7f350
47c3ae2
db7f350
 
 
3d87820
 
 
 
 
5f9d165
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
TITLE = """<h1 align="center" id="space-title">GAIA Leaderboard</h1>"""

CANARY_STRING = "" # TODO

INTRODUCTION_TEXT = """
Large language models have seen their potential capabilities increased by several orders of magnitude with the introduction of augmentations, from simple prompting adjustement to actual external tooling (calculators, vision models, ...) or online web retrieval.
To evaluate the next generation of LLMs, we argue for a new kind of benchmark, simple and yet effective to measure actual progress on augmented capabilities, and therefore present GAIA. Details in the paper.

GAIA is made of 3 evaluation levels, depending on the added level of tooling and autonomy the model needs.
We expect the level 1 to be breakable by very good LLMs, and the level 3 to indicate a strong jump in model capabilities.
Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.

We expect submissions to be json-line files with the following format:
```
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
...
```

Scores are expressed as the percentage of correct answers for a given split.

Please do not repost the public dev set, nor use it in training data for your models.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{gaia, # TODO
  author = {tbd},
  title = {General AI Assistant benchamrk},
  year = {2023},
}"""


def format_warning(msg):
    return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"

def format_log(msg):
    return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"

def model_hyperlink(link, model_name):
    return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'