Clémentine commited on
Commit
21e3c8a
·
1 Parent(s): bed6c0a

New content

Browse files
Files changed (1) hide show
  1. content.py +4 -2
content.py CHANGED
@@ -9,10 +9,12 @@ GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with aug
9
  ## Context
10
  GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
11
 
12
- It is divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities, each divided into two sets: a fully public dev set, on which people can test their models, and a test set with private answers and metadata. Scores are expressed as the percentage of correct answers for a given split.
 
 
13
 
14
  ## Submissions
15
- Results can be submitted for both validation and test. We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
16
  ```
17
  {"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
18
  {"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
 
9
  ## Context
10
  GAIA is made of more than 450 non-trivial question with an unambiguous answer, requiring different levels of tooling and autonomy to solve. GAIA data can be found in this space (https://huggingface.co/datasets/gaia-benchmark/GAIA). Questions are contained in `metadata.jsonl`. Some questions come with an additional file, that can be found in the same folder and whose id is given in the field `file_name`.
11
 
12
+ It is divided in 3 levels, where level 1 should be breakable by very good LLMs, and level 3 indicate a strong jump in model capabilities, each divided into a fully public dev set for validation, and a test set with private answers and metadata.
13
+ Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.
14
+
15
 
16
  ## Submissions
17
+ We expect submissions to be json-line files with the following format. The first two fields are mandatory, `reasoning_trace` is optionnal:
18
  ```
19
  {"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
20
  {"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}