jacobfulano
commited on
Commit
•
3fc528a
1
Parent(s):
eac65e8
Update README.md
Browse files
README.md
CHANGED
@@ -7,23 +7,25 @@ language:
|
|
7 |
inference: false
|
8 |
---
|
9 |
|
10 |
-
# MosaicBERT: mosaic-bert-base-seqlen-512
|
11 |
|
12 |
MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
|
13 |
MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
|
14 |
Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
|
15 |
|
16 |
-
__This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409)
|
17 |
|
18 |
-
|
|
|
|
|
|
|
19 |
|
20 |
* [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
|
21 |
* mosaic-bert-base-seqlen-512
|
22 |
-
* [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-
|
23 |
* mosaic-bert-base-seqlen-2048 (soon)
|
24 |
|
25 |
-
|
26 |
-
Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
|
27 |
|
28 |
## Model Date
|
29 |
|
@@ -136,7 +138,7 @@ corpora like English Wikipedia and BooksCorpus.
|
|
136 |
Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
|
137 |
|
138 |
1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
|
139 |
-
for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 512
|
140 |
|
141 |
|
142 |
2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
|
@@ -154,8 +156,8 @@ This small constraint is something of [a magic trick among ML practitioners](htt
|
|
154 |
|
155 |
5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
|
156 |
The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
|
157 |
-
Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128
|
158 |
-
We set the maximum sequence length during pretraining to
|
159 |
For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
160 |
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
|
161 |
|
|
|
7 |
inference: false
|
8 |
---
|
9 |
|
10 |
+
# MosaicBERT: mosaic-bert-base-seqlen-512 Pretrained Model
|
11 |
|
12 |
MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
|
13 |
MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
|
14 |
Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
|
15 |
|
16 |
+
__This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) on a sequence length of 512 tokens.__
|
17 |
|
18 |
+
ALiBi allows a model trained with a sequence length n to easily extrapolate to sequence lengths >2n during finetuning. For more details, see [Train Short, Test Long: Attention with Linear
|
19 |
+
Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
|
20 |
+
|
21 |
+
It is part of the **family of MosaicBERT-Base models**:
|
22 |
|
23 |
* [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
|
24 |
* mosaic-bert-base-seqlen-512
|
25 |
+
* [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-512)
|
26 |
* mosaic-bert-base-seqlen-2048 (soon)
|
27 |
|
28 |
+
|
|
|
29 |
|
30 |
## Model Date
|
31 |
|
|
|
138 |
Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
|
139 |
|
140 |
1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
|
141 |
+
for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of **sequence length 512**; this covers 78.6% of C4.
|
142 |
|
143 |
|
144 |
2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
|
|
|
156 |
|
157 |
5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
|
158 |
The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
|
159 |
+
Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was **128**; since global batch size was 4096, full pretraining consisted of 70,000 batches.
|
160 |
+
We set the **maximum sequence length during pretraining to 512**, and we used the standard embedding dimension of 768.
|
161 |
For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
162 |
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
|
163 |
|