Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Gregor Betz
commited on
Commit
•
44ef4de
1
Parent(s):
689a5b6
descr.
Browse files- src/display/about.py +27 -3
src/display/about.py
CHANGED
@@ -26,15 +26,38 @@ TITLE = """<h1 align="center" id="space-title"><code>/\/</code> Open CoT
|
|
26 |
|
27 |
# What does your leaderboard evaluate?
|
28 |
INTRODUCTION_TEXT = """
|
29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
"""
|
31 |
|
32 |
# Which evaluations are you running? how can people reproduce what you have?
|
33 |
-
LLM_BENCHMARKS_TEXT =
|
34 |
## How it works
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
## Reproducibility
|
37 |
-
To reproduce our results,
|
38 |
|
39 |
"""
|
40 |
|
@@ -75,4 +98,5 @@ We're populating the Open CoT Leaderboard step by step. The idea is to grow a di
|
|
75 |
|
76 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
77 |
CITATION_BUTTON_TEXT = r"""
|
|
|
78 |
"""
|
|
|
26 |
|
27 |
# What does your leaderboard evaluate?
|
28 |
INTRODUCTION_TEXT = """
|
29 |
+
The `/\/` Open CoT Leaderboard tracks the reasoning skills of LLMs, measured as their ability to generate **effective chain-of-thought reasoning traces**.
|
30 |
+
|
31 |
+
The leaderboard reports **accuracy gains** achieved by using CoT, i.e.
|
32 |
+
|
33 |
+
> _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_.
|
34 |
+
|
35 |
+
See the "About" tab for more details and motivation.
|
36 |
"""
|
37 |
|
38 |
# Which evaluations are you running? how can people reproduce what you have?
|
39 |
+
LLM_BENCHMARKS_TEXT = """
|
40 |
## How it works
|
41 |
|
42 |
+
A CoT `regime` consists in a prompt chain and decoding parameters used to generate a reasoning trace. To assess the reasoning skill of a given `model`, we carry out the following steps for each `task` (test dataset) and each `regime`:
|
43 |
+
|
44 |
+
1. Generate CoT reasoning traces for all problems in the test dataset with `model` and according to `regime`.
|
45 |
+
2. Let the model answer the test dataset problems and record the resulting _baseline accuracy_.
|
46 |
+
3. Let the model answer the test dataset problems _with the reasoning traces appended_ to the prompt and record the resulting _CoT accuracy_.
|
47 |
+
4. Compute the _accuracy gain Δ_ = _CoT accuracy_ - _baseline accuracy_ for the given `model`, `task`, and `regime`.
|
48 |
+
|
49 |
+
Each regime has a different accuracy gain Δ, and the leaderboard reports the best Δ achieved by a regime.
|
50 |
+
|
51 |
+
|
52 |
+
## How is it different from other leaderboards?
|
53 |
+
|
54 |
+
...
|
55 |
+
|
56 |
+
## Test dataset selection (`tasks`)
|
57 |
+
|
58 |
+
|
59 |
## Reproducibility
|
60 |
+
To reproduce our results, check out the repository [cot-eval](https://github.com/logikon-ai/cot-eval).
|
61 |
|
62 |
"""
|
63 |
|
|
|
98 |
|
99 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
100 |
CITATION_BUTTON_TEXT = r"""
|
101 |
+
Logikon AI Team. (2024). Open CoT Leaderboard. Retrieved from https://huggingface.co/spaces/logikon/open_cot_leaderboard
|
102 |
"""
|