Adding Evaluation Results

#3
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -6,4 +6,17 @@ This model was the result of a 50/50 average weight merge between Airoboros-33B-
6
  After prolonged testing we concluded that while this merge is highly flexible and capable of many different tasks, it has to much variation in how it answers to be reliable.
7
  Because of this the model relies on some luck to get good results, and is therefore not recommended to people seeking a consistent experience, or people sensitive to anticipation based addictions.
8
 
9
- If you would like an improved version of this model that is more stable check out my Airochronos-33B merge.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  After prolonged testing we concluded that while this merge is highly flexible and capable of many different tasks, it has to much variation in how it answers to be reliable.
7
  Because of this the model relies on some luck to get good results, and is therefore not recommended to people seeking a consistent experience, or people sensitive to anticipation based addictions.
8
 
9
+ If you would like an improved version of this model that is more stable check out my Airochronos-33B merge.
10
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
11
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Henk717__chronoboros-33B)
12
+
13
+ | Metric | Value |
14
+ |-----------------------|---------------------------|
15
+ | Avg. | 51.45 |
16
+ | ARC (25-shot) | 63.91 |
17
+ | HellaSwag (10-shot) | 85.0 |
18
+ | MMLU (5-shot) | 59.44 |
19
+ | TruthfulQA (0-shot) | 49.83 |
20
+ | Winogrande (5-shot) | 80.35 |
21
+ | GSM8K (5-shot) | 15.01 |
22
+ | DROP (3-shot) | 6.62 |