Update README.md
Browse files
README.md
CHANGED
@@ -154,12 +154,13 @@ print(tokenizer.decode(outputs[0]))
|
|
154 |
## Training Data
|
155 |
|
156 |
Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
|
157 |
-
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length
|
|
|
158 |
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
159 |
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
160 |
|
161 |
-
Overall, the data sources included RefinedWeb-English, high quality technical data, code data and
|
162 |
-
In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
|
163 |
|
164 |
The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
|
165 |
|
|
|
154 |
## Training Data
|
155 |
|
156 |
Falcon-Mamba has been trained with ~ 5,500 GT mainly coming from [Refined-Web](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), a large volume web-only dataset filtered and deduplicated.
|
157 |
+
Similar to the others [Falcon](https://huggingface.co/tiiuae/falcon-11B) suite models, Falcon-Mamba has been trained leveraging a multi-stage training strategy to increase the context-length (from 2,048 to 8,192).
|
158 |
+
Moreover, inspired by the Curriculum Learning concept, we carefully choose data mixtures along the training stages, on both data diversity and complexity.
|
159 |
Note that at inference the context-length is not relevant as the Mamba architecture has no limit on long range dependency.
|
160 |
At the last training stage, small portion of high-quality curated data was used to further enhance performance.
|
161 |
|
162 |
+
Overall, the data sources included RefinedWeb-English, high quality technical data, code data and math data extracted from public sources.
|
163 |
+
In particular, we used samples coming from [Fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) during our last training stage.
|
164 |
|
165 |
The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7B)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
|
166 |
|