Update README.md
Browse files
README.md
CHANGED
@@ -59,12 +59,12 @@ The following are the scores from our own evaluation.
|
|
59 |
| Metric | Value |
|
60 |
|-----------------------|-------|
|
61 |
| ARC (25-shot) | 67.32 |
|
62 |
-
| HellaSwag (10-shot) |
|
63 |
-
| MMLU (5-shot) |
|
64 |
| TruthfulQA (0-shot) | 54.17 |
|
65 |
| Winogrande (5-shot) | 80.72 |
|
66 |
-
| GSM8k (5-shot) |
|
67 |
-
| **Avg.** | **
|
68 |
|
69 |
|
70 |
We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.
|
|
|
59 |
| Metric | Value |
|
60 |
|-----------------------|-------|
|
61 |
| ARC (25-shot) | 67.32 |
|
62 |
+
| HellaSwag (10-shot) | 86.25 |
|
63 |
+
| MMLU (5-shot) | 70.72 |
|
64 |
| TruthfulQA (0-shot) | 54.17 |
|
65 |
| Winogrande (5-shot) | 80.72 |
|
66 |
+
| GSM8k (5-shot) | 25.09 (bad score. no clue why)|
|
67 |
+
| **Avg.** | **64.05** |
|
68 |
|
69 |
|
70 |
We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.
|