Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Clémentine
commited on
Commit
•
bed6c0a
1
Parent(s):
bb6c22e
New content
Browse files- content.py +8 -18
content.py
CHANGED
@@ -3,33 +3,23 @@ TITLE = """<h1 align="center" id="space-title">GAIA Leaderboard</h1>"""
|
|
3 |
CANARY_STRING = "" # TODO
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
|
|
|
|
6 |
|
7 |
-
|
|
|
8 |
|
9 |
-
|
10 |
-
To evaluate the next generation of LLMs, we argue for a new kind of benchmark, simple and yet effective to measure actual progress on augmented capabilities, and therefore present GAIA. Details in the paper.
|
11 |
|
12 |
-
|
13 |
-
|
14 |
-
Each of these levels is divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Results can be submitted for both validation and test.
|
15 |
-
|
16 |
-
# Data
|
17 |
-
|
18 |
-
GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). It consists in ~466 questions distributed in two splits, with similar distribution of Levels. Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
|
19 |
-
|
20 |
-
# Submissions
|
21 |
-
|
22 |
-
We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
|
23 |
```
|
24 |
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
|
25 |
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
|
26 |
-
...
|
27 |
```
|
28 |
-
|
29 |
-
Scores are expressed as the percentage of correct answers for a given split.
|
30 |
Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
|
31 |
|
32 |
-
Please do not repost the public dev set, nor use it in training data for your models
|
33 |
"""
|
34 |
|
35 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
|
|
3 |
CANARY_STRING = "" # TODO
|
4 |
|
5 |
INTRODUCTION_TEXT = """
|
6 |
+
GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc).
|
7 |
+
(See our paper for more details.)
|
8 |
|
9 |
+
## Context
|
10 |
+
GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
|
11 |
|
12 |
+
It is divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities, each divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Scores are expressed as the percentage of correct answers for a given split.
|
|
|
13 |
|
14 |
+
## Submissions
|
15 |
+
Results can be submitted for both validation and test. We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
```
|
17 |
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
|
18 |
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
|
|
|
19 |
```
|
|
|
|
|
20 |
Submission made by our team are labelled "GAIA authors". While we report average scores over different runs when possible in our paper, we only report the best run in the leaderboard.
|
21 |
|
22 |
+
**Please do not repost the public dev set, nor use it in training data for your models.**
|
23 |
"""
|
24 |
|
25 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|