Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Gregor Betz
commited on
Commit
•
db734a2
1
Parent(s):
3b4e965
Link to notebook
Browse files- src/display/about.py +2 -1
src/display/about.py
CHANGED
@@ -46,6 +46,7 @@ To assess the reasoning skill of a given `model`, we carry out the following ste
|
|
46 |
|
47 |
Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
|
48 |
|
|
|
49 |
|
50 |
## How is it different from other leaderboards?
|
51 |
|
@@ -68,7 +69,7 @@ Unlike these leaderboards, the `/\/` Open CoT Leaderboard assesses a model's abi
|
|
68 |
|
69 |
## Test dataset selection (`tasks`)
|
70 |
|
71 |
-
The test dataset
|
72 |
|
73 |
|
74 |
## Reproducibility
|
|
|
46 |
|
47 |
Each `regime` yields a different _accuracy gain Δ_, and the leaderboard reports (for every `model`/`task`) the best Δ achieved by any regime. All models are evaluated against the same set of regimes.
|
48 |
|
49 |
+
A notebook with detailed result exploration and visualization is available [here](https://github.com/logikon-ai/cot-eval/blob/main/notebooks/CoT_Leaderboard_Results_Exploration.ipynb).
|
50 |
|
51 |
## How is it different from other leaderboards?
|
52 |
|
|
|
69 |
|
70 |
## Test dataset selection (`tasks`)
|
71 |
|
72 |
+
The test dataset problems in the CoT Leaderboard can be solved through clear thinking alone, no specific knowledge is required to do so. They are subsets of the [AGIEval benchmark](https://github.com/ruixiangcui/AGIEval) and re-published as [`logikon-bench`](https://huggingface.co/datasets/logikon/logikon-bench). The `logiqa` dataset has been newly translated from Chinese to English.
|
73 |
|
74 |
|
75 |
## Reproducibility
|