Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,166 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: falcon-mamba-license
|
4 |
+
license_link: https://falconllm.tii.ae/falcon-mamba-7b-terms-and-conditions.html
|
5 |
+
base_model: tiiuae/falcon-mamba-7b-instruct
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
datasets:
|
9 |
+
- tiiuae/falcon-refinedweb
|
10 |
+
---
|
11 |
+
|
12 |
+
<img src="https://huggingface.co/datasets/tiiuae/documentation-images/resolve/main/falcon_mamba/thumbnail.png" alt="drawing" width="800"/>
|
13 |
+
|
14 |
+
**GGUF quantization of [`falcon-mamba-7b-instruct`](https://huggingface.co/tiiuae/falcon-mamba-7b-instruct) in the format `F16`**
|
15 |
+
|
16 |
+
# Table of Contents
|
17 |
+
|
18 |
+
0. [TL;DR](#TL;DR)
|
19 |
+
1. [Model Details](#model-details)
|
20 |
+
2. [Usage](#usage)
|
21 |
+
3. [Training Details](#training-details)
|
22 |
+
4. [Evaluation](#evaluation)
|
23 |
+
|
24 |
+
|
25 |
+
# TL;DR
|
26 |
+
|
27 |
+
# Model Details
|
28 |
+
|
29 |
+
## Model Description
|
30 |
+
|
31 |
+
- **Developed by:** [https://www.tii.ae](https://www.tii.ae)
|
32 |
+
- **Model type:** Causal decoder-only
|
33 |
+
- **Architecture:** Mamba
|
34 |
+
- **Language(s) (NLP):** Mainly English
|
35 |
+
- **License:** TII Falcon-Mamba License 2.0
|
36 |
+
|
37 |
+
<br>
|
38 |
+
|
39 |
+
# Usage
|
40 |
+
|
41 |
+
Refer to the documentation of [`llama.cpp`](https://github.com/ggerganov/llama.cpp) to understand how to run this model locally on your machine.
|
42 |
+
|
43 |
+
Download the GGUF weights with the command below:
|
44 |
+
|
45 |
+
```bash
|
46 |
+
huggingface-cli download tiiuae/falcon-mamba-7b-instruct-F16-GGUF --include falcon-mamba-instruct-F16.gguf --local-dir ./
|
47 |
+
```
|
48 |
+
Then you can run it with:
|
49 |
+
```bash
|
50 |
+
./llama-cli -m falcon-mamba-instruct-F16.gguf -p "Hello how are you?"
|
51 |
+
```
|
52 |
+
|
53 |
+
# Training Details
|
54 |
+
|
55 |
+
## Training Data
|
56 |
+
|
57 |
+
Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
|
58 |
+
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length from 2,048 to 8,192.
|
59 |
+
Moreover, inspired by the concept of Curriculum Learning, we carefully selected data mixtures throughout the training stages, considering both data diversity and complexity.
|
60 |
+
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
61 |
+
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
62 |
+
|
63 |
+
Overall, the data sources included RefinedWeb-English, high quality technical data, code data and math data extracted from public sources.
|
64 |
+
In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) during our last training stage.
|
65 |
+
|
66 |
+
The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
|
67 |
+
|
68 |
+
## Training Procedure
|
69 |
+
Falcon-Mamba-7B was trained on 256 H100 80GB GPUs for the majority of the training, using a 3D parallelism strategy (TP=1, PP=1, DP=256) combined with ZeRO.
|
70 |
+
|
71 |
+
### Training Hyperparameters
|
72 |
+
|
73 |
+
| **Hyperparameter** | **Value** | **Comment** |
|
74 |
+
|--------------------|------------|-------------------------------------------|
|
75 |
+
| Precision | `bfloat16` | |
|
76 |
+
| Optimizer | AdamW | |
|
77 |
+
| Max learning rate | 6.4e-4 | Following a WSD (warmup-stable-decay) learning rate schedule |
|
78 |
+
| Weight decay | 1e-1 | |
|
79 |
+
| Batch size | 2048 | |
|
80 |
+
|
81 |
+
|
82 |
+
The model was trained AdamW optimizer, WSD (warmup-stable-decay) learning rate schedule, and a batch size rampup from \\(b_{\mathrm{min}}=128\\) to \\(b_{\mathrm{max}}=2048\\) during first 50 GT of training.
|
83 |
+
In the stable phase we used maximal learning rate \\(\eta_{\mathrm{max}}=6.4 \times 10^{-4}\\), and decayed it to the minimal value \\(\eta_{\mathrm{min}}=\frac{\eta_{\mathrm{max}}}{256}\\) with exponential schedule over 500 GT.
|
84 |
+
Also, we applied *BatchScaling* during the rampup — rescaling learning rate \\(\eta\\) so that the Adam noise temperature \\(T_{\mathrm{noise}}\equiv\frac{\eta}{\sqrt{b}}\\) is kept constant.
|
85 |
+
|
86 |
+
### Speeds, Sizes, Times
|
87 |
+
|
88 |
+
The model training took roughly two months.
|
89 |
+
|
90 |
+
<br>
|
91 |
+
|
92 |
+
# Evaluation
|
93 |
+
|
94 |
+
## Benchmarks
|
95 |
+
|
96 |
+
We evaluate our model on all benchmarks of the new leaderboard's version using the `lm-evaluation-harness` package, and then normalize the evaluation results with HuggingFace score normalization.
|
97 |
+
|
98 |
+
|
99 |
+
| `model name` |`IFEval`| `BBH` |`MATH LvL5`| `GPQA`| `MUSR`|`MMLU-PRO`|`Average`|
|
100 |
+
|:--------------------------|:------:|:-----:|:---------:|:-----:|:-----:|:--------:|:-------:|
|
101 |
+
| ***Pure SSM models*** | | | | | | | |
|
102 |
+
| `FalconMamba-7B` | 33.36 | 19.88 | 3.63 |8.05 |10.86 | 14.47 |**15.04**|
|
103 |
+
| `TRI-ML/mamba-7b-rw`<sup>*</sup>| 22.46 | 6.71 | 0.45 | 1.12 | 5.51 | 1.69 | 6.25 |
|
104 |
+
|***Hybrid SSM-attention models*** | | | | | | |
|
105 |
+
|`recurrentgemma-9b` | 30.76 | 14.80 | 4.83 | 4.70 | 6.60 | 17.88 | 13.20 |
|
106 |
+
| `Zyphra/Zamba-7B-v1`<sup>*</sup> | 24.06 | 21.12 | 3.32 | 3.03 | 7.74 | 16.02 | 12.55 |
|
107 |
+
|***Transformer models*** | | | | | | | |
|
108 |
+
| `Falcon2-11B` | 32.61 | 21.94 | 2.34 | 2.80 | 7.53 | 15.44 | 13.78 |
|
109 |
+
| `Meta-Llama-3-8B` | 14.55 | 24.50 | 3.25 | 7.38 | 6.24 | 24.55 | 13.41 |
|
110 |
+
| `Meta-Llama-3.1-8B` | 12.70 | 25.29 | 4.61 | 6.15 | 8.98 | 24.95 | 13.78 |
|
111 |
+
| `Mistral-7B-v0.1` | 23.86 | 22.02 | 2.49 | 5.59 | 10.68 | 22.36 | 14.50 |
|
112 |
+
| `Mistral-Nemo-Base-2407 (12B)` | 16.83 | 29.37 | 4.98 | 5.82 | 6.52 | 27.46 | 15.08 |
|
113 |
+
| `gemma-7B` | 26.59 | 21.12 | 6.42 | 4.92 | 10.98 | 21.64 |**15.28**|
|
114 |
+
|
115 |
+
|
116 |
+
Also, we evaluate our model on the benchmarks of the first leaderboard using `lighteval`.
|
117 |
+
|
118 |
+
|
119 |
+
| `model name` |`ARC`|`HellaSwag` |`MMLU` |`Winogrande`|`TruthfulQA`|`GSM8K`|`Average` |
|
120 |
+
|:-----------------------------|:------:|:---------:|:-----:|:----------:|:----------:|:-----:|:----------------:|
|
121 |
+
| ***Pure SSM models*** | | | | | | | |
|
122 |
+
| `FalconMamba-7B`<sup>*</sup> | 62.03 | 80.82 | 62.11 | 73.64 | 53.42 | 52.54 | **64.09** |
|
123 |
+
| `TRI-ML/mamba-7b-rw`<sup>*</sup> | 51.25 | 80.85 | 33.41 | 71.11 | 32.08 | 4.70 | 45.52 |
|
124 |
+
|***Hybrid SSM-attention models***| | | | | | | |
|
125 |
+
| `recurrentgemma-9b`<sup>**</sup> |52.00 | 80.40 | 60.50 | 73.60 | 38.60 | 42.60 | 57.95 |
|
126 |
+
| `Zyphra/Zamba-7B-v1`<sup>*</sup> | 56.14 | 82.23 | 58.11 | 79.87 | 52.88 | 30.78 | 60.00 |
|
127 |
+
|***Transformer models*** | | | | | | | |
|
128 |
+
| `Falcon2-11B` | 59.73 | 82.91 | 58.37 | 78.30 | 52.56 | 53.83 | **64.28** |
|
129 |
+
| `Meta-Llama-3-8B` | 60.24 | 82.23 | 66.70 | 78.45 | 42.93 | 45.19 | 62.62 |
|
130 |
+
| `Meta-Llama-3.1-8B` | 58.53 | 82.13 | 66.43 | 74.35 | 44.29 | 47.92 | 62.28 |
|
131 |
+
| `Mistral-7B-v0.1` | 59.98 | 83.31 | 64.16 | 78.37 | 42.15 | 37.83 | 60.97 |
|
132 |
+
| `gemma-7B` | 61.09 | 82.20 | 64.56 | 79.01 | 44.79 | 50.87 | 63.75 |
|
133 |
+
|
134 |
+
Mostly, we took evaluation results from both leaderboards. For the models marked by *star* we evaluated the tasks internally, while for the models marked by two *stars* the results were taken from paper or model card.
|
135 |
+
|
136 |
+
# Technical Specifications
|
137 |
+
|
138 |
+
## Model Architecture and Objective
|
139 |
+
|
140 |
+
Falcon-Mamba-7B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token).
|
141 |
+
|
142 |
+
The model is based on the Mamba architecture ([Gu et al., 2023](https://arxiv.org/abs/2312.00752)).
|
143 |
+
|
144 |
+
| **Hyperparameter** | **Value** | **Comment** |
|
145 |
+
|--------------------|-----------|----------------------------------------|
|
146 |
+
| Layers | 64 | Number of layers |
|
147 |
+
| `d_model` | 4096 | Hidden dimension |
|
148 |
+
| `d_state` | 16 | The SSM state dimension |
|
149 |
+
| Vocabulary | 65024 | Vocabulary Size |
|
150 |
+
| Sequence length | 8192 | During the last training stages |
|
151 |
+
|
152 |
+
## Compute Infrastructure
|
153 |
+
|
154 |
+
### Hardware
|
155 |
+
|
156 |
+
Falcon-Mamba-7B was trained on AWS SageMaker, using on average 256 H100 80GB GPUs in 32 p5 instances.
|
157 |
+
|
158 |
+
### Software
|
159 |
+
|
160 |
+
Falcon-Mamba-7B was trained on an internal distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO, high-performance Triton kernels.
|
161 |
+
|
162 |
+
<br>
|
163 |
+
|
164 |
+
# Citation
|
165 |
+
|
166 |
+
*Paper coming soon* 😊.
|