Linsong-C commited on
Commit
0198d34
·
verified ·
1 Parent(s): ca1cc08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -169
README.md CHANGED
@@ -8,9 +8,9 @@ license: apache-2.0
8
  # Model Card for Bamba 9B
9
  We introduce Bamba-9B, a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
10
 
11
- | Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings |
12
- |-------------------|--------------|----------|-------------|-----------------|-----|----------|----------------|------------------|
13
- | Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | False |
14
 
15
 
16
  The current release includes the following models:
@@ -69,168 +69,134 @@ contributed [HF-version of Mamba2-Hybrid]() (TODO: add link once live).
69
  ### Base pretrained models
70
 
71
  <table>
72
- <tr>
73
- <td><strong>Category</strong>
74
- </td>
75
- <td><strong>Benchmark</strong>
76
- </td>
77
- <td><strong>Setting</strong></td>
78
- <td><strong>Metric</strong></td>
79
- <td><strong>Bamba 9B (2.2T)</strong>
80
- </td>
81
- </tr>
82
- <tr>
83
- <td rowspan="8" >General
84
- </td>
85
- <td>MMLU
86
- </td>
87
- <td>5-shot</td>
88
- <td>Accuracy</td>
89
- <td>60.77
90
- </td>
91
- </tr>
92
- <tr>
93
- <td>ARC-C
94
- </td>
95
- <td>25-shot</td>
96
- <td>Accuracy normalized</td>
97
- <td>63.23
98
- </td>
99
- </tr>
100
- <tr>
101
- <td>GSM8K
102
- </td>
103
- <td>5-shot</td>
104
- <td>exact match</td>
105
- <td>36.77
106
- </td>
107
- </tr>
108
- <tr>
109
- <td>Hellaswag
110
- </td>
111
- <td>10-shot</td>
112
- <td>Accuracy normalized</td>
113
- <td>81.8
114
- </td>
115
- </tr>
116
- <tr>
117
- <td>OpenbookQA
118
- </td>
119
- <td>5-shot</td>
120
- <td>Accuracy normalized</td>
121
- <td>47.6
122
- </td>
123
- </tr>
124
- <tr>
125
- <td>Piqa
126
- </td>
127
- <td>5-shot</td>
128
- <td>Accuracy normalized</td>
129
- <td>82.26
130
- </td>
131
- </tr>
132
- <tr>
133
- <td>TruthfulQA
134
- </td>
135
- <td>0-shot</td>
136
- <td>Accuracy</td>
137
- <td>49.21
138
- </td>
139
- </tr>
140
- <tr>
141
- <td>Winogrande
142
- </td>
143
- <td>5-shot</td>
144
- <td>Accuracy</td>
145
- <td>76.87
146
- </td>
147
- </tr>
148
- <tr>
149
- <td rowspan="6" >HF LLM- V2
150
- </td>
151
- <td>MMLU-PRO
152
- </td>
153
- <td>5-shot</td>
154
- <td>Accuracy</td>
155
- <td>17.53
156
- </td>
157
- </tr>
158
- <tr>
159
- <td>BBH
160
- </td>
161
- <td>3-shot</td>
162
- <td>Accuracy normalized</td>
163
- <td>17.4
164
- </td>
165
- </tr>
166
- <tr>
167
- <td>GPQA
168
- </td>
169
- <td>0-shot</td>
170
- <td>Accuracy normalized</td>
171
- <td>4.14
172
- </td>
173
- </tr>
174
- <tr>
175
- <td>IFEval
176
- </td>
177
- <td>0-shot</td>
178
- <td>inst_level_strict_acc + prompt_level_strict_acc</td>
179
- <td>15.16
180
- </td>
181
- </tr>
182
- <tr>
183
- <td>MATH Lvl 5
184
- </td>
185
- <td>4-shot</td>
186
- <td>Exact match</td>
187
- <td>1.66
188
- </td>
189
- </tr>
190
- <tr>
191
- <td>MuSR
192
- </td>
193
- <td>0-shot</td>
194
- <td>Accuracy normalized</td>
195
- <td>9.59
196
- </td>
197
- </tr>
198
- <tr>
199
- <td rowspan="4" >Safety Tasks
200
- </td>
201
- <td>PopQA
202
- </td>
203
- <td>5-shot, generation</td>
204
- <td>Accuracy</td>
205
- <td>20.5
206
- </td>
207
- </tr>
208
- <tr>
209
- <td>Toxigen
210
- </td>
211
- <td>5-shot, logits</td>
212
- <td>Accuracy</td>
213
- <td>57.4
214
- </td>
215
- </tr>
216
- <tr>
217
- <td>BBQ
218
- </td>
219
- <td>5-shot, generation</td>
220
- <td>Accuracy</td>
221
- <td>44.2
222
- </td>
223
- </tr>
224
- <tr>
225
- <td>Crows-pairs_english
226
- </td>
227
- <td>5-shot, generation</td>
228
- <td>pct_stereotype (lower is better)</td>
229
- <td>70.78
230
- </td>
231
- </tr>
232
  </table>
233
 
 
 
 
 
234
 
235
  ## Fine-tuning
236
 
@@ -247,15 +213,13 @@ python -m fms_mo.run_quant \
247
  --output_dir <"path_to_save_new_model">
248
  ```
249
  Model size comparison before and after FP8:
250
- ||original|quantized |
251
- |:----:|----:|----:|
252
- |memory (total)|39.12 GB|10.83 GB|
253
- |memory (break-down)|`torch.float32` 39.12 GB|`torch.bfloat16` 2.10 GB<br>`torch.float8_e4m3fn` 8.73 GB|
254
 
255
  More details about `fms-model-optimizer` can be found [here](https://github.com/foundation-model-stack/fms-model-optimizer/tree/main/examples/FP8_QUANT#quickstart).
256
 
257
- ## Evaluation
258
-
259
 
260
  ## Llama.cpp
261
  There is preliminary work to enable running Bamba architecture models using [llama.cpp](https://github.com/ggerganov/llama.cpp). This is work-in-progress, so should only be used as a guide for the adventurous!
 
8
  # Model Card for Bamba 9B
9
  We introduce Bamba-9B, a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
10
 
11
+ | Model | Params | # Layers | Hidden Dim. | Attention Heads | GQA | KV Heads | Context Length | Tied Embeddings |
12
+ | ----- | ---------- | -------- | ----------- | --------------- | ---- | -------- | -------------- | --------------- |
13
+ | Bamba | 9B (9.78B) | 32 | 4096 | 32 | Yes | 8 | 4096 | False |
14
 
15
 
16
  The current release includes the following models:
 
69
  ### Base pretrained models
70
 
71
  <table>
72
+ <tr>
73
+ <td><strong>Category</strong>
74
+ </td>
75
+ <td><strong>Benchmark</strong>
76
+ </td>
77
+ <td><strong>Bamba 9B (2.2T)</strong>
78
+ </td>
79
+ </tr>
80
+ <tr>
81
+ <td rowspan="8" >General
82
+ </td>
83
+ <td>MMLU (5-shot)
84
+ </td>
85
+ <td>60.77
86
+ </td>
87
+ </tr>
88
+ <tr>
89
+ <td>ARC-C (25-shot)
90
+ </td>
91
+ <td>63.23
92
+ </td>
93
+ </tr>
94
+ <tr>
95
+ <td>GSM8K (5-shot)
96
+ </td>
97
+ <td>36.77
98
+ </td>
99
+ </tr>
100
+ <tr>
101
+ <td>Hellaswag (10-shot)
102
+ </td>
103
+ <td>81.8
104
+ </td>
105
+ </tr>
106
+ <tr>
107
+ <td>OpenbookQA (5-shot)
108
+ </td>
109
+ <td>47.6
110
+ </td>
111
+ </tr>
112
+ <tr>
113
+ <td>Piqa (5-shot)
114
+ </td>
115
+ <td>82.26
116
+ </td>
117
+ </tr>
118
+ <tr>
119
+ <td>TruthfulQA (0-shot)
120
+ </td>
121
+ <td>49.21
122
+ </td>
123
+ </tr>
124
+ <tr>
125
+ <td>Winogrande (5-shot)
126
+ </td>
127
+ <td>76.87
128
+ </td>
129
+ </tr>
130
+ <tr>
131
+ <td rowspan="6" >HF OpenLLM- V2*
132
+ </td>
133
+ <td>MMLU-PRO (5-shot)
134
+ </td>
135
+ <td>17.53
136
+ </td>
137
+ </tr>
138
+ <tr>
139
+ <td>BBH (3-shot)
140
+ </td>
141
+ <td>17.4
142
+ </td>
143
+ </tr>
144
+ <tr>
145
+ <td>GPQA (0-shot)
146
+ </td>
147
+ <td>4.14
148
+ </td>
149
+ </tr>
150
+ <tr>
151
+ <td>IFEval (0-shot)
152
+ </td>
153
+ <td>15.16
154
+ </td>
155
+ </tr>
156
+ <tr>
157
+ <td>MATH Lvl 5 (4-shot)
158
+ </td>
159
+ <td>1.66
160
+ </td>
161
+ </tr>
162
+ <tr>
163
+ <td>MuSR (0-shot)
164
+ </td>
165
+ <td>9.59
166
+ </td>
167
+ </tr>
168
+ <tr>
169
+ <td rowspan="4" >Safety Tasks
170
+ </td>
171
+ <td>PopQA (5-shot)
172
+ </td>
173
+ <td>20.5
174
+ </td>
175
+ </tr>
176
+ <tr>
177
+ <td>Toxigen (5-shot)
178
+ </td>
179
+ <td>57.4
180
+ </td>
181
+ </tr>
182
+ <tr>
183
+ <td>BBQ (5-shot)
184
+ </td>
185
+ <td>44.2
186
+ </td>
187
+ </tr>
188
+ <tr>
189
+ <td>Crows-pairs english (5-shot)
190
+ </td>
191
+ <td>70.78
192
+ </td>
193
+ </tr>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  </table>
195
 
196
+ *For the v2 leaderboard results, we perform [normalization](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization) and report the normalized results.
197
+ Further details on our evaluation and normalization detailes along with run and analysis scripts can be found [here](https://github.com/foundation-model-stack/bamba/blob/main/evaluation/README.md).
198
+
199
+
200
 
201
  ## Fine-tuning
202
 
 
213
  --output_dir <"path_to_save_new_model">
214
  ```
215
  Model size comparison before and after FP8:
216
+ | | original | quantized |
217
+ | :-----------------: | -----------------------: | -----------------------------------------------------------: |
218
+ | memory (total) | 39.12 GB | 10.83 GB |
219
+ | memory (break-down) | `torch.float32` 39.12 GB | `torch.bfloat16` 2.10 GB<br>`torch.float8_e4m3fn` 8.73 GB |
220
 
221
  More details about `fms-model-optimizer` can be found [here](https://github.com/foundation-model-stack/fms-model-optimizer/tree/main/examples/FP8_QUANT#quickstart).
222
 
 
 
223
 
224
  ## Llama.cpp
225
  There is preliminary work to enable running Bamba architecture models using [llama.cpp](https://github.com/ggerganov/llama.cpp). This is work-in-progress, so should only be used as a guide for the adventurous!