File size: 12,675 Bytes
ca1cc08
 
019d8d2
 
 
ca1cc08
 
 
987642a
 
 
ca1cc08
 
 
 
0198d34
 
 
ca1cc08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
612539a
 
 
 
 
ca1cc08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8b2ef2
ca1cc08
 
 
43f2151
ca1cc08
 
 
 
 
 
 
0198d34
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ca1cc08
 
0198d34
 
 
 
ca1cc08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0198d34
 
 
 
ca1cc08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
019d8d2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
license: apache-2.0
library_name: transformers
tags:
- bamba
---

## Model Details
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/64b6c638ac6d20bae0b93219/GOzs8o4G1apceun92ZC4d.png" alt="Bamba" width="400" height="400">
</p>

# Model Card for Bamba 9B
We introduce Bamba-9B, a decoder-only language model based on the [Mamba-2](https://github.com/state-spaces/mamba) architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

| Model | Params     | # Layers | Hidden Dim. | Attention Heads | GQA  | KV Heads | Context Length | Tied Embeddings |
| ----- | ---------- | -------- | ----------- | --------------- | ---- | -------- | -------------- | --------------- |
| Bamba | 9B (9.78B) | 32       | 4096        | 32              | Yes  | 8        | 4096           | False           |


The current release includes the following models:
| **Stage**            | **Bamba 9B**                                                         | **Quantized**                                                           | **Note**                                                          |
|----------------------|----------------------------------------------------------------------|-------------------------------------------------------------------------|-------------------------------------------------------------------|
| **Base Model**       | [ibm-fms/Bamba-9B](https://huggingface.co/ibm-fms/Bamba-9B)          | [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8)     | Stage 2 pretraining                                               |
| **Base Model**       | [ibm-fms/Bamba-9B-2T](https://huggingface.co/ibm-fms/Bamba-9B-2T)    | [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8)     | Stage 1 pretraining                                               |
| **Base Model**       | [ibm-fms/Bamba-9B-1.8T](https://huggingface.co/ibm-fms/Bamba-9B-1.8T)| [ibm-fms/Bamba-9B-fp8](https://huggingface.co/ibm-fms/Bamba-9B-fp8)     | Intermediate checkpoints during Stage 1, more to come             |
| **SFT**              | coming soon                                                          | coming soon                                                             | to be released in the next drop                                   | 
| **DPO**              | coming soon                                                          | coming soon                                                             | to be released in the next drop                                   |

## Installation

Besides [PyTorch](https://pytorch.org/), you would need a few [extra dependencies](https://github.com/state-spaces/mamba?tab=readme-ov-file#installation) for
Mamba models.

We found some of these dependencies picky on PyTorch versions when doing pip install, so 
the best way is to build from source for all Mamba dependencies if you hit dependency 
issue with your env:
```bash
git clone https://github.com/Dao-AILab/causal-conv1d.git
cd causal-conv1d && pip install . && cd ..
git clone https://github.com/state-spaces/mamba.git
cd mamba && pip install . && cd ..
git clone https://github.com/Dao-AILab/flash-attention.git
cd flash-attention && pip install . && cd ..
```

For users using our HF versions of the model, you would need to install the latest transformers which includes our newly merged implementation for our Bamba models:
```bash
pip install git+https://github.com/huggingface/transformers.git
```

## Inference
You can utilize our newly contributed HF integration to run inference on our Bamba models:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B")
tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")

message = ["Mamba is a snake with following properties  "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])

```


## Training

We trained our Bamba model with FSDP using our training repo [here](https://github.com/foundation-model-stack/fms-fsdp).
Note that this training effort was started before FSDP2 and also long before we contributed
`Mamba2-Hybrid` to HF, so we were doing FSDP1 training with [official Mamba implementation](https://github.com/state-spaces/mamba).
For users trying to reproduce the training you now have much more options with our newly
contributed [HF-version of Mamba2-Hybrid](https://github.com/huggingface/transformers/tree/main/src/transformers/models/bamba).


## Benchmark scores

### Base pretrained models

<table>
 <tr>
<td><strong>Category</strong>
</td>
<td><strong>Benchmark</strong>
</td>
<td><strong>Bamba 9B (2.2T)</strong>
</td>
</tr>
<tr>
<td rowspan="8" >General
</td>
<td>MMLU (5-shot)
</td>
<td>60.77
</td>
</tr>
<tr>
<td>ARC-C (25-shot)
</td>
<td>63.23
</td>
</tr>
<tr>
<td>GSM8K (5-shot)
</td>
<td>36.77
</td>
</tr>
<tr>
<td>Hellaswag (10-shot)
</td>
<td>81.8
</td>
</tr>
<tr>
<td>OpenbookQA (5-shot)
</td>
<td>47.6
</td>
</tr>
<tr>
<td>Piqa (5-shot)
</td>
<td>82.26
</td>
</tr>
<tr>
<td>TruthfulQA (0-shot)
</td>
<td>49.21
</td>
</tr>
<tr>
<td>Winogrande (5-shot)
</td>
<td>76.87
</td>
</tr>
<tr>
<td rowspan="6" >HF OpenLLM- V2*
</td>
<td>MMLU-PRO (5-shot)
</td>
<td>17.53
</td>
</tr>
<tr>
<td>BBH (3-shot)
</td>
<td>17.4
</td>
</tr>
<tr>
<td>GPQA (0-shot)
</td>
<td>4.14
</td>
</tr>
<tr>
<td>IFEval (0-shot)
</td>
<td>15.16
</td>
</tr>
<tr>
<td>MATH Lvl 5 (4-shot)
</td>
<td>1.66
</td>
</tr>
<tr>
<td>MuSR (0-shot)
</td>
<td>9.59
</td>
</tr>
<tr>
<td rowspan="4" >Safety Tasks
</td>
<td>PopQA (5-shot)
</td>
<td>20.5
</td>
</tr>
<tr>
<td>Toxigen (5-shot)
</td>
<td>57.4
</td>
</tr>
<tr>
<td>BBQ (5-shot)
</td>
<td>44.2
</td>
</tr>
<tr>
<td>Crows-pairs english (5-shot)
</td>
<td>70.78
</td>
</tr>
</table>

*For the v2 leaderboard results, we perform [normalization](https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization) and report the normalized results.
Further details on our evaluation and normalization detailes along with run and analysis scripts can be found [here](https://github.com/foundation-model-stack/bamba/blob/main/evaluation/README.md).



## Fine-tuning

This [example](https://github.com/foundation-model-stack/bamba/blob/main/tuning/Fine-tuning.md) shows how to fine tune the bamba model for a specific task using [SFT Trainer](https://huggingface.co/docs/trl/en/sft_trainer#supervised-fine-tuning-trainer).

                           
## Quantization
We can create a (FP8) quantized model using [`fms-model-optimizer`](https://github.com/foundation-model-stack/fms-model-optimizer/), which will make the storage and inference even more efficient.
```python
python -m fms_mo.run_quant \
    --model_name_or_path <"path_to_original_model"> \
    --quant_method fp8 \
    --torch_dtype bfloat16 \
    --output_dir <"path_to_save_new_model">
```
Model size comparison before and after FP8:
|                     |                 original |                                                    quantized |
| :-----------------: | -----------------------: | -----------------------------------------------------------: |
|   memory (total)    |                 39.12 GB |                                                     10.83 GB |
| memory (break-down) | `torch.float32` 39.12 GB | `torch.bfloat16` 2.10 GB<br>`torch.float8_e4m3fn`    8.73 GB |

More details about `fms-model-optimizer` can be found [here](https://github.com/foundation-model-stack/fms-model-optimizer/tree/main/examples/FP8_QUANT#quickstart).


## Llama.cpp
There is preliminary work to enable running Bamba architecture models using [llama.cpp](https://github.com/ggerganov/llama.cpp). This is work-in-progress, so should only be used as a guide for the adventurous!

### Known Limitations

* Currently, inference is only supported on CPUs
* Models quantized with `llama-quantize` exhibit bad performance

### Setup
To enable Bamba support, you'll need to build from source using [Gabe's fork](https://github.com/gabe-l-hart/llama.cpp/tree/BambaArchitecture).

```sh
git clone --branch BambaArchitecture git@github.com:gabe-l-hart/llama.cpp.git
cd llama.cpp
mkdir build
cd build
# NOTE: To build with debug symbols and extra logging, use CMAKE_BUILD_TYPE=Debug
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j
```

### Conversion to GGUF
You can use a pre-converted GGUF file from Huggingface (e.g. [bamba-9b.gguf](https://huggingface.co/ibm-fms/Bamba-9B/blob/main/bamba-9b.gguf)). If one doesn't exist, you can use the [convert_hf_to_gguf.py](https://github.com/gabe-l-hart/llama.cpp/blob/BambaArchitecture/convert_hf_to_gguf.py) script from Gabe's fork to perform the conversion manually.

```sh
# Install the python dependencies
cd /path/to/llama.cpp
pip install -r requirements/requirements-convert_hf_to_gguf.txt

# Perform the conversion
./convert_hf_to_gguf.py /path/to/bamba-model --outfile /path/to/bamba-model/bamba-model.gguf
```

### Run with llama-cli

```sh
# Run the model with no layers on the GPU (CPU-only)
cd /path/to/llama.cpp
./bin/llama-cli  -ngl 0 -m /path/to/bamba-model/bamba-model.gguf -p "Tell me a story about a developer and their dog"
```

### Quantization with llama-quantize
You can (optionally) quantize the GGUF model using `llama.cpp`'s built in quantizaiton tool `llama-quantize`.

```sh
# Run the quantization (see llama-quantize --help for all quant types)
cd /path/to/llama.cpp
./build/bin/llama-quantize /path/to/bamba-model/bamba-model.gguf Q4_K_M
```

## Contributors

* **Data collection and curation**: We acknowledge and thank AllenAI team for making a high quality open source dataset Dolma as well as Hugging Face data team for making FineWeb-edu and Cosmopedia available. These are tremendous contributions and enable us to create the model today.  
* **Data preprocessing**: We thank IBM's internal data preprocessing team, specifically Tuan Hoang Trong, Syed Zawad, Jay Gala, and Ryan Gordon for helping tokenize the data at scale. The code for tokenization is available [here](https://github.com/IBM/data-prep-kit).  
* **Model architecture**: The model architecture design was jointly done by Princeton, CMU, IBM, and UIUC and involved the following folks: Tri Dao (Princeton), Albert Gu (CMU), Linsong Chu (IBM), Davis Wertheimer (IBM), Minjia Zhang (UIUC), Mudhakar Srivatsa (IBM), and Raghu Ganti (IBM).  
* **Model training**: Model training was performed primarily by the IBM team using the Mamba2 kernels and layer implementation from Tri Dao and Albert Gu. The following folks from IBM were primarily involved: Linsong Chu, Divya Kumari, Davis Wertheimer, Raghu Ganti, and Dakshi Agrawal.  
* **Model tuning**: Tuning of the model was enabled and verified in [TRL](https://github.com/huggingface/trl) by the IBM team, involving Sukriti Sharma and Anh Uong.  
* **Model inference**: Model inference in `transformers`, `vLLM`, and `llama.cpp` builds on the kernels written by Princeton and CMU. The IBM team is working with the community to enable it in various ecosystems, the team includes Fabian Lim, Antoni viros i Martin, Adnan Hoque, Jamie Yang, Nelson Nimura Gomez, Joshua Rosenkranz, Nick Hill, and Gabe Goodhart.  
* **Quantization**: Quantization is led by the IBM team \- Naigang Wang and Charlie Liu.  
* **Evaluations**: Evaluations are led by a team in IBM with long context evaluations being performed by UIUC, involving the following folks: Yotam Perlitz, Ofir Arviv, Michal Shmueli-Scheuer (IBM), Haoechen Shen, and Minjia Zhang (UIUC).

Finally, we would like to thank our leadership for their support in this effort \- Priya Nagpurkar, David Cox, Sriram Raghavan, Aya Soffer, and Mukesh Khare.

We would also like to thank the community, in particular Pablo Montalvo-Leroux and Vaibhav Srivastav from Hugging Face who provided valuable feedback to this blog and the PRs into transformers. Further, we would like to thank Tyler Michael Smith from Neural Magic, who is shepherding the integration with vLLM.

A huge shoutout to Meta PyTorch, AllenAI, and Hugging Face teams for their contributions to the open initative, FSDP allowed us to smoothly train this model and the data from Dolma and Fineweb/Cosmopedia made this model today!