DiscoResearch
/

DiscoLM-mixtral-8x7b-v2

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

bjoernp commited on Dec 9, 2023

Commit

8d1fa90

·

1 Parent(s): 1f1a677

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -59,12 +59,12 @@ The following are the scores from our own evaluation.
 | Metric | Value |
 |-----------------------|-------|
 | ARC (25-shot)         | 67.32 |
-| HellaSwag (10-shot)   | xx |
-| MMLU (5-shot)         | xx |
 | TruthfulQA (0-shot)   | 54.17 |
 | Winogrande (5-shot)   | 80.72 |
-| GSM8k (5-shot)   | xx |
-| **Avg.**                  | **xx** |
 We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.

 | Metric | Value |
 |-----------------------|-------|
 | ARC (25-shot)         | 67.32 |
+| HellaSwag (10-shot)   |  86.25 |
+| MMLU (5-shot)         | 70.72 |
 | TruthfulQA (0-shot)   | 54.17 |
 | Winogrande (5-shot)   | 80.72 |
+| GSM8k (5-shot)   |  25.09 (bad score. no clue why)|
+| **Avg.**                  | **64.05** |
 We use [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above, using the same version as the HuggingFace LLM Leaderboard.