Update README.md
Browse files
README.md
CHANGED
@@ -104,7 +104,7 @@ The dataset is comprised of a mixture of open datasets large-scale datasets avai
|
|
104 |
| ARC (25-shot) | 47.0 |
|
105 |
| HellaSwag (10-shot) | 74.2 |
|
106 |
| MMLU (5-shot) | 46.3 |
|
107 |
-
| TruthfulQA (0-shot) | 46.
|
108 |
| Winogrande (5-shot) | 65.5 |
|
109 |
| GSM8K (5-shot) | 42.3 |
|
110 |
|
@@ -112,7 +112,7 @@ The dataset is comprised of a mixture of open datasets large-scale datasets avai
|
|
112 |
2. BigBench:
|
113 |
|
114 |
- Average: 35.26
|
115 |
-
- Details:
|
116 |
|
117 |
| Task | Version | Metric | Value | Stderr |
|
118 |
|-----------------------------------------------------|---------|-------------------------|-------|--------|
|
@@ -138,6 +138,46 @@ The dataset is comprised of a mixture of open datasets large-scale datasets avai
|
|
138 |
| bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 0.1856| 0.0110 |
|
139 |
| bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 0.1269| 0.0080 |
|
140 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
### Training Infrastructure
|
142 |
|
143 |
* **Hardware**: `Stable Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
|
|
|
104 |
| ARC (25-shot) | 47.0 |
|
105 |
| HellaSwag (10-shot) | 74.2 |
|
106 |
| MMLU (5-shot) | 46.3 |
|
107 |
+
| TruthfulQA (0-shot) | 46.5 |
|
108 |
| Winogrande (5-shot) | 65.5 |
|
109 |
| GSM8K (5-shot) | 42.3 |
|
110 |
|
|
|
112 |
2. BigBench:
|
113 |
|
114 |
- Average: 35.26
|
115 |
+
- Details:
|
116 |
|
117 |
| Task | Version | Metric | Value | Stderr |
|
118 |
|-----------------------------------------------------|---------|-------------------------|-------|--------|
|
|
|
138 |
| bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 0.1856| 0.0110 |
|
139 |
| bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 0.1269| 0.0080 |
|
140 |
|
141 |
+
3. AGI:
|
142 |
+
- Average: 33.23
|
143 |
+
- Details:
|
144 |
+
| Task |Version| Metric |Value | |Stderr|
|
145 |
+
|------------------------------|------:|--------|-----:|---|-----:|
|
146 |
+
|agieval_aqua_rat | 0|acc |0.2126|± |0.0257|
|
147 |
+
| | |acc_norm|0.1890|± |0.0246|
|
148 |
+
|agieval_gaokao_biology | 0|acc |0.2571|± |0.0302|
|
149 |
+
| | |acc_norm|0.3143|± |0.0321|
|
150 |
+
|agieval_gaokao_chemistry | 0|acc |0.2464|± |0.0300|
|
151 |
+
| | |acc_norm|0.2899|± |0.0316|
|
152 |
+
|agieval_gaokao_chinese | 0|acc |0.2927|± |0.0291|
|
153 |
+
| | |acc_norm|0.3049|± |0.0294|
|
154 |
+
|agieval_gaokao_english | 0|acc |0.6176|± |0.0278|
|
155 |
+
| | |acc_norm|0.6438|± |0.0274|
|
156 |
+
|agieval_gaokao_geography | 0|acc |0.3015|± |0.0326|
|
157 |
+
| | |acc_norm|0.3065|± |0.0328|
|
158 |
+
|agieval_gaokao_history | 0|acc |0.3106|± |0.0303|
|
159 |
+
| | |acc_norm|0.3319|± |0.0308|
|
160 |
+
|agieval_gaokao_mathqa | 0|acc |0.2650|± |0.0236|
|
161 |
+
| | |acc_norm|0.2707|± |0.0237|
|
162 |
+
|agieval_gaokao_physics | 0|acc |0.3450|± |0.0337|
|
163 |
+
| | |acc_norm|0.3550|± |0.0339|
|
164 |
+
|agieval_logiqa_en | 0|acc |0.2980|± |0.0179|
|
165 |
+
| | |acc_norm|0.3195|± |0.0183|
|
166 |
+
|agieval_logiqa_zh | 0|acc |0.2842|± |0.0177|
|
167 |
+
| | |acc_norm|0.3318|± |0.0185|
|
168 |
+
|agieval_lsat_ar | 0|acc |0.2000|± |0.0264|
|
169 |
+
| | |acc_norm|0.2043|± |0.0266|
|
170 |
+
|agieval_lsat_lr | 0|acc |0.3176|± |0.0206|
|
171 |
+
| | |acc_norm|0.3275|± |0.0208|
|
172 |
+
|agieval_lsat_rc | 0|acc |0.4312|± |0.0303|
|
173 |
+
| | |acc_norm|0.4201|± |0.0301|
|
174 |
+
|agieval_sat_en | 0|acc |0.6117|± |0.0340|
|
175 |
+
| | |acc_norm|0.6117|± |0.0340|
|
176 |
+
|agieval_sat_en_without_passage| 0|acc |0.3398|± |0.0331|
|
177 |
+
| | |acc_norm|0.3495|± |0.0333|
|
178 |
+
|agieval_sat_math | 0|acc |0.3182|± |0.0315|
|
179 |
+
| | |acc_norm|0.2909|± |0.0307|
|
180 |
+
|
181 |
### Training Infrastructure
|
182 |
|
183 |
* **Hardware**: `Stable Zephyr 3B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
|