Update README.md
Browse files
README.md
CHANGED
@@ -135,8 +135,25 @@ The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate s
|
|
135 |
|
136 |
# Evaluation
|
137 |
|
138 |
-
|
139 |
|
140 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
|
142 |
-
Ilyas
|
|
|
135 |
|
136 |
# Evaluation
|
137 |
|
138 |
+
## Benchmarks
|
139 |
|
140 |
+
We evaluate our model on all benchmarks of the leaderboard's version 2 using the `lm-evaluation-harness` package, and we evaluate it on the benchmarks of version 1 using `lighteval`.
|
141 |
+
|
142 |
+
| model_name | IFEval | BBH | MATH LvL5 | GPQA | MUSR | MMLU-PRO | **Average L2** | ARC | HellaSwag | MMLU | Winogrande | TruthfulQA | GSM8K | **Average L1** |
|
143 |
+
|------------------------------|--------|-------|-----------|-------|-------|----------|----------------|-------|-----------|-------|------------|------------|-------|----------------|
|
144 |
+
| `meta-llama/Meta-Llama-3-8B` | 14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 | 60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
|
145 |
+
| `tiiuae/falcon2-11B` | 32.61 | 21.94 | 2.34 | 2.8 | 7.53 | 15.44 | 13.78 | 59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | **64.28** |
|
146 |
+
| `mistralai/Mistral-7B-v0.1` | 23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 | 59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
|
147 |
+
| `Zyphra/Zamba-7B-v1` | - | - | - | - | - | - | - | 46.48 | 80.24 | 57.72 | 76.4 | - | - | - |
|
148 |
+
| Ours | 32.16 | 21.07 | 4.08 | 10.18 | 6.97 | 13.43 | **14.65** | 61.69 | 80.63 | 61.05 | 74.03 | 53.60 | 51.86 | 63.81 |
|
149 |
+
|
150 |
+
## Throughput
|
151 |
+
|
152 |
+
This model can achieve comparable throughput and performance compared to other transformer based models that use optimized kernels such as Flash Attention 2. Make sure to install the optimized Mamba kernels with the following commands:
|
153 |
+
|
154 |
+
```bash
|
155 |
+
pip install "causal-conv1d>=1.4.0" mamba-ssm
|
156 |
+
```
|
157 |
+
|
158 |
+
Refer to our technical report for more details about performance evaluation.
|
159 |
|
|