clefourrier HF staff alvarobartt HF staff commited on
Commit
7abc6a7
β€’
1 Parent(s): 96d111a

Update benchmark count and fix typo (`inetuning->finetuning`) (#395)

Browse files

- Update benchmark count and fix typo (`inetuning->finetuning`) (cdeea55b7621c0b1fa7515a40bf2fb50df62d5d7)


Co-authored-by: Alvaro Bartolome <alvarobartt@users.noreply.huggingface.co>

Files changed (1) hide show
  1. src/display/about.py +2 -2
src/display/about.py CHANGED
@@ -28,7 +28,7 @@ If there is no icon, we have not uploaded the information on the model yet, feel
28
 
29
  ## How it works
30
 
31
- πŸ“ˆ We evaluate models on 4 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
32
 
33
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
34
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
@@ -67,7 +67,7 @@ The tasks and few shots parameters are:
67
  Side note on the baseline scores:
68
  - for log-likelihood evaluation, we select the random baseline
69
  - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
70
- - for GSM8K, we select the score obtained in the paper after inetuning a 6B model on the full GSM8K training set for 50 epochs
71
 
72
  ## Quantization
73
  To get more information about quantization, see:
 
28
 
29
  ## How it works
30
 
31
+ πŸ“ˆ We evaluate models on 7 key benchmarks using the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of different evaluation tasks.
32
 
33
  - <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge </a> (25-shot) - a set of grade-school science questions.
34
  - <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag </a> (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
 
67
  Side note on the baseline scores:
68
  - for log-likelihood evaluation, we select the random baseline
69
  - for DROP, we select the best submission score according to [their leaderboard](https://leaderboard.allenai.org/drop/submissions/public) when the paper came out (NAQANet score)
70
+ - for GSM8K, we select the score obtained in the paper after finetuning a 6B model on the full GSM8K training set for 50 epochs
71
 
72
  ## Quantization
73
  To get more information about quantization, see: