File size: 2,574 Bytes
47c3ae2
db7f350
5f9d165
01d1bbb
bb6c22e
bed6c0a
 
bb6c22e
21e3c8a
 
01d1bbb
c177f62
 
21e3c8a
5f9d165
 
 
 
ac816a0
 
bed6c0a
db7f350
 
 
47c3ae2
db7f350
47c3ae2
db7f350
 
 
3d87820
5822c90
 
 
3d87820
 
 
 
5f9d165
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
TITLE = """<h1 align="center" id="space-title">GAIA Leaderboard</h1>"""

INTRODUCTION_TEXT = """
GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.)

## Context
GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.

It is divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities, each divided into a fully public dev set for validation, and a test set with private answers and metadata. 

# Submissions
Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split. 

We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
```
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
```
Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.

**Please do not repost the public dev set, nor use it in training data for your models.**
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{gaia, # TODO
  author = {tbd},
  title = {General AI Assistant benchamrk},
  year = {2023},
}"""


def format_error(msg):
    return f"<p style='color: red; font-size: 20px; text-align: center;'>{msg}</p>"

def format_warning(msg):
    return f"<p style='color: orange; font-size: 20px; text-align: center;'>{msg}</p>"

def format_log(msg):
    return f"<p style='color: green; font-size: 20px; text-align: center;'>{msg}</p>"

def model_hyperlink(link, model_name):
    return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'