reshinthadith
commited on
Commit
•
46ef235
1
Parent(s):
004f712
Update README.md
Browse files
README.md
CHANGED
@@ -66,7 +66,7 @@ The first pre-training stage relies on 300B tokens sourced from various top prog
|
|
66 |
|
67 |
### Training Procedure
|
68 |
|
69 |
-
The model is pre-trained on the dataset mixes mentioned above in mixed-precision BF16), optimized with AdamW, and trained using the
|
70 |
|
71 |
* **Software**: We use a fork of gpt-neox ([EleutherAI, 2021](https://github.com/EleutherAI/gpt-neox)) and train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 ([Rajbhandari et al., 2019](https://arxiv.org/abs/1910.02054v3)) and rely on flash-attention as well as rotary embedding kernels from FlashAttention-2 ([Dao et al., 2023](https://tridao.me/publications/flash2/flash2.pdf))
|
72 |
|
|
|
66 |
|
67 |
### Training Procedure
|
68 |
|
69 |
+
The model is pre-trained on the dataset mixes mentioned above in mixed-precision BF16), optimized with AdamW, and trained using the StarCoder tokenizer with a vocabulary size of 49k.
|
70 |
|
71 |
* **Software**: We use a fork of gpt-neox ([EleutherAI, 2021](https://github.com/EleutherAI/gpt-neox)) and train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 ([Rajbhandari et al., 2019](https://arxiv.org/abs/1910.02054v3)) and rely on flash-attention as well as rotary embedding kernels from FlashAttention-2 ([Dao et al., 2023](https://tridao.me/publications/flash2/flash2.pdf))
|
72 |
|