bjoernp commited on
Commit
8d1fa90
·
1 Parent(s): 1f1a677

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -59,12 +59,12 @@ The following are the scores from our own evaluation.
59
  | Metric | Value |
60
  |-----------------------|-------|
61
  | ARC (25-shot) | 67.32 |
62
- | HellaSwag (10-shot) | xx |
63
- | MMLU (5-shot) | xx |
64
  | TruthfulQA (0-shot) | 54.17 |
65
  | Winogrande (5-shot) | 80.72 |
66
- | GSM8k (5-shot) | xx |
67
- | **Avg.** | **xx** |
68
 
69
 
70
  We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.
 
59
  | Metric | Value |
60
  |-----------------------|-------|
61
  | ARC (25-shot) | 67.32 |
62
+ | HellaSwag (10-shot) | 86.25 |
63
+ | MMLU (5-shot) | 70.72 |
64
  | TruthfulQA (0-shot) | 54.17 |
65
  | Winogrande (5-shot) | 80.72 |
66
+ | GSM8k (5-shot) | 25.09 (bad score. no clue why)|
67
+ | **Avg.** | **64.05** |
68
 
69
 
70
  We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.