Spaces:
Sleeping
Sleeping
Commit
β’
7abc6a7
1
Parent(s):
96d111a
Update benchmark count and fix typo (`inetuning->finetuning`) (#395)
Browse files- Update benchmark count and fix typo (`inetuning->finetuning`) (cdeea55b7621c0b1fa7515a40bf2fb50df62d5d7)
Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>
- src/display/about.py +2 -2
src/display/about.py
CHANGED
@@ -28,7 +28,7 @@ If there is no icon, we have not uploaded the information on the model yet, feel
|
|
28 |
|
29 |
## How it works
|
30 |
|
31 |
-
π We evaluate models on
|
32 |
|
33 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
34 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
@@ -67,7 +67,7 @@ The tasks and few shots parameters are:
|
|
67 |
Side note on the baseline scores:
|
68 |
- for log-likelihood evaluation, we select the random baseline
|
69 |
- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
|
70 |
-
- for GSM8K, we select the score obtained in the paper after
|
71 |
|
72 |
## Quantization
|
73 |
To get more information about quantization, see:
|
|
|
28 |
|
29 |
## How it works
|
30 |
|
31 |
+
π We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
|
32 |
|
33 |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
|
34 |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
|
|
67 |
Side note on the baseline scores:
|
68 |
- for log-likelihood evaluation, we select the random baseline
|
69 |
- for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
|
70 |
+
- for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
|
71 |
|
72 |
## Quantization
|
73 |
To get more information about quantization, see:
|