Update README.md
Browse files
README.md
CHANGED
@@ -194,54 +194,6 @@ The model training took roughly two months.
|
|
194 |
|
195 |
# Evaluation
|
196 |
|
197 |
-
## Benchmarks
|
198 |
-
|
199 |
-
We evaluate our model on all benchmarks of the new leaderboard's version using the `lm-evaluation-harness` package, and then normalize the evaluation results with HuggingFace score normalization.
|
200 |
-
|
201 |
-
|
202 |
-
| `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
|
203 |
-
|:--------------------------|:------:|:-----:|:---------:|:-----:|:-----:|:--------:|:-------:|
|
204 |
-
| ***Pure SSM models*** | | | | | | | |
|
205 |
-
| `FalconMamba-7B` | 33.36 | 19.88 | 3.63 |8.05 |10.86 | 14.47 |**15.04**|
|
206 |
-
| `TRI-ML/mamba-7b-rw`<sup>*</sup>| 22.46 | 6.71 | 0.45 | 1.12 | 5.51 | 1.69 | 6.25 |
|
207 |
-
|***Hybrid SSM-attention models*** | | | | | | |
|
208 |
-
|`recurrentgemma-9b` | 30.76 | 14.80 | 4.83 | 4.70 | 6.60 | 17.88 | 13.20 |
|
209 |
-
| `Zyphra/Zamba-7B-v1`<sup>*</sup> | 24.06 | 21.12 | 3.32 | 3.03 | 7.74 | 16.02 | 12.55 |
|
210 |
-
|***Transformer models*** | | | | | | | |
|
211 |
-
| `Falcon2-11B` | 32.61 | 21.94 | 2.34 | 2.80 | 7.53 | 15.44 | 13.78 |
|
212 |
-
| `Meta-Llama-3-8B` | 14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 |
|
213 |
-
| `Meta-Llama-3.1-8B` | 12.70 | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 13.78 |
|
214 |
-
| `Mistral-7B-v0.1` | 23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 |
|
215 |
-
| `Mistral-Nemo-Base-2407 (12B)` | 16.83 | 29.37 | 4.98 | 5.82 | 6.52 | 27.46 | 15.08 |
|
216 |
-
| `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
|
217 |
-
|***RWKV models*** | | | | | | | |
|
218 |
-
| `RWKV-v6-Finch-7B`<sup>*</sup> | 27.65 | 9.04 | 1.11 | 2.81 | 2.25 | 5.85 | 8.12 |
|
219 |
-
| `RWKV-v6-Finch-14B`<sup>*</sup> | 29.81 | 12.89 | 1.13 | 5.01 | 3.16 | 11.3 | 10.55 |
|
220 |
-
|
221 |
-
Also, we evaluate our model on the benchmarks of the first leaderboard using `lighteval`.
|
222 |
-
|
223 |
-
|
224 |
-
| `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
|
225 |
-
|:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
|
226 |
-
| ***Pure SSM models*** | | | | | | | |
|
227 |
-
| `FalconMamba-7B`<sup>*</sup> | 62.03 | 80.82 | 62.11 | 73.64 | 53.42 | 52.54 | **64.09** |
|
228 |
-
| `TRI-ML/mamba-7b-rw`<sup>*</sup> | 51.25 | 80.85 | 33.41 | 71.11 | 32.08 | 4.70 | 45.52 |
|
229 |
-
|***Hybrid SSM-attention models***| | | | | | | |
|
230 |
-
| `recurrentgemma-9b`<sup>**</sup> |52.00 | 80.40 | 60.50 | 73.60 | 38.60 | 42.60 | 57.95 |
|
231 |
-
| `Zyphra/Zamba-7B-v1`<sup>*</sup> | 56.14 | 82.23 | 58.11 | 79.87 | 52.88 | 30.78 | 60.00 |
|
232 |
-
|***Transformer models*** | | | | | | | |
|
233 |
-
| `Falcon2-11B` | 59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | **64.28** |
|
234 |
-
| `Meta-Llama-3-8B` | 60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
|
235 |
-
| `Meta-Llama-3.1-8B` | 58.53 | 82.13 | 66.43 | 74.35 | 44.29 | 47.92 | 62.28 |
|
236 |
-
| `Mistral-7B-v0.1` | 59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
|
237 |
-
| `Mistral-Nemo-Base-2407 (12B)`<sup>*</sup> | 57.94 | 82.82 | 64.43 | 73.72 | 49.14 | 55.27 | 63.89 |
|
238 |
-
| `gemma-7B` | 61.09 | 82.20 | 64.56 | 79.01 | 44.79 | 50.87 | 63.75 |
|
239 |
-
|***RWKV models*** | | | | | | | |
|
240 |
-
| `RWKV-v6-Finch-7B`<sup>*</sup> | 43.86 | 75.19 | 41.69 | 68.27 | 42.19 | 19.64 | 48.47 |
|
241 |
-
| `RWKV-v6-Finch-14B`<sup>*</sup> | 47.44 | 78.86 | 52.33 | 71.27 | 45.45 | 38.06 | 55.57 |
|
242 |
-
|
243 |
-
Mostly, we took evaluation results from both leaderboards. For the models marked by *star* we evaluated the tasks internally, while for the models marked by two *stars* the results were taken from paper or model card.
|
244 |
-
|
245 |
## Throughput
|
246 |
|
247 |
This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands:
|
|
|
194 |
|
195 |
# Evaluation
|
196 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
197 |
## Throughput
|
198 |
|
199 |
This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands:
|