Safetensors
English
falcon_mamba
ybelkada commited on
Commit
e36f8ee
1 Parent(s): 8ea5bea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -48
README.md CHANGED
@@ -194,54 +194,6 @@ The model training took roughly two months.
194
 
195
  # Evaluation
196
 
197
- ## Benchmarks
198
-
199
- We evaluate our model on all benchmarks of the new leaderboard's version using the `lm-evaluation-harness` package, and then normalize the evaluation results with HuggingFace score normalization.
200
-
201
-
202
- | `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
203
- |:--------------------------|:------:|:-----:|:---------:|:-----:|:-----:|:--------:|:-------:|
204
- | ***Pure SSM models*** | | | | | | | |
205
- | `FalconMamba-7B` | 33.36 | 19.88 | 3.63 |8.05 |10.86 | 14.47 |**15.04**|
206
- | `TRI-ML/mamba-7b-rw`<sup>*</sup>| 22.46 | 6.71 | 0.45 | 1.12 | 5.51 | 1.69 | 6.25 |
207
- |***Hybrid SSM-attention models*** | | | | | | |
208
- |`recurrentgemma-9b` | 30.76 | 14.80 | 4.83 | 4.70 | 6.60 | 17.88 | 13.20 |
209
- | `Zyphra/Zamba-7B-v1`<sup>*</sup> | 24.06 | 21.12 | 3.32 | 3.03 | 7.74 | 16.02 | 12.55 |
210
- |***Transformer models*** | | | | | | | |
211
- | `Falcon2-11B` | 32.61 | 21.94 | 2.34 | 2.80 | 7.53 | 15.44 | 13.78 |
212
- | `Meta-Llama-3-8B` | 14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 |
213
- | `Meta-Llama-3.1-8B` | 12.70 | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 13.78 |
214
- | `Mistral-7B-v0.1` | 23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 |
215
- | `Mistral-Nemo-Base-2407 (12B)` | 16.83 | 29.37 | 4.98 | 5.82 | 6.52 | 27.46 | 15.08 |
216
- | `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
217
- |***RWKV models*** | | | | | | | |
218
- | `RWKV-v6-Finch-7B`<sup>*</sup> | 27.65 | 9.04 | 1.11 | 2.81 | 2.25 | 5.85 | 8.12 |
219
- | `RWKV-v6-Finch-14B`<sup>*</sup> | 29.81 | 12.89 | 1.13 | 5.01 | 3.16 | 11.3 | 10.55 |
220
-
221
- Also, we evaluate our model on the benchmarks of the first leaderboard using `lighteval`.
222
-
223
-
224
- | `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
225
- |:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
226
- | ***Pure SSM models*** | | | | | | | |
227
- | `FalconMamba-7B`<sup>*</sup> | 62.03 | 80.82 | 62.11 | 73.64 | 53.42 | 52.54 | **64.09** |
228
- | `TRI-ML/mamba-7b-rw`<sup>*</sup> | 51.25 | 80.85 | 33.41 | 71.11 | 32.08 | 4.70 | 45.52 |
229
- |***Hybrid SSM-attention models***| | | | | | | |
230
- | `recurrentgemma-9b`<sup>**</sup> |52.00 | 80.40 | 60.50 | 73.60 | 38.60 | 42.60 | 57.95 |
231
- | `Zyphra/Zamba-7B-v1`<sup>*</sup> | 56.14 | 82.23 | 58.11 | 79.87 | 52.88 | 30.78 | 60.00 |
232
- |***Transformer models*** | | | | | | | |
233
- | `Falcon2-11B` | 59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | **64.28** |
234
- | `Meta-Llama-3-8B` | 60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
235
- | `Meta-Llama-3.1-8B` | 58.53 | 82.13 | 66.43 | 74.35 | 44.29 | 47.92 | 62.28 |
236
- | `Mistral-7B-v0.1` | 59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
237
- | `Mistral-Nemo-Base-2407 (12B)`<sup>*</sup> | 57.94 | 82.82 | 64.43 | 73.72 | 49.14 | 55.27 | 63.89 |
238
- | `gemma-7B` | 61.09 | 82.20 | 64.56 | 79.01 | 44.79 | 50.87 | 63.75 |
239
- |***RWKV models*** | | | | | | | |
240
- | `RWKV-v6-Finch-7B`<sup>*</sup> | 43.86 | 75.19 | 41.69 | 68.27 | 42.19 | 19.64 | 48.47 |
241
- | `RWKV-v6-Finch-14B`<sup>*</sup> | 47.44 | 78.86 | 52.33 | 71.27 | 45.45 | 38.06 | 55.57 |
242
-
243
- Mostly, we took evaluation results from both leaderboards. For the models marked by *star* we evaluated the tasks internally, while for the models marked by two *stars* the results were taken from paper or model card.
244
-
245
  ## Throughput
246
 
247
  This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands:
 
194
 
195
  # Evaluation
196
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ## Throughput
198
 
199
  This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands: