mosaicml
/

mosaic-bert-base-seqlen-512

@@ -7,23 +7,25 @@ language:
 inference: false
 ---
-# MosaicBERT: mosaic-bert-base-seqlen-512 pretrained model
 MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
 MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
 Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
-__This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) and a sequence length of 512 tokens.__
-It is part of the family of MosaicBERT-Base models:
 * [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
 * mosaic-bert-base-seqlen-512
-* [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024)
 * mosaic-bert-base-seqlen-2048 (soon)
-* ALiBi allows a model trained with a sequence length n to extrapolate to sequence lengths >2n. For more details, see [Train Short, Test Long: Attention with Linear
-Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
 ## Model Date
@@ -136,7 +138,7 @@ corpora like English Wikipedia and BooksCorpus.
 Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
 1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
-for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 512; this covers 78.6% of C4.
 2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
@@ -154,8 +156,8 @@ This small constraint is something of [a magic trick among ML practitioners](htt
 5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
 The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
-Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
-We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
 For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
 Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).

 inference: false
 ---
+# MosaicBERT: mosaic-bert-base-seqlen-512 Pretrained Model
 MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
 MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
 Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
+__This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) on a sequence length of 512 tokens.__
+ALiBi allows a model trained with a sequence length n to easily extrapolate to sequence lengths >2n during finetuning. For more details, see [Train Short, Test Long: Attention with Linear
+Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
+It is part of the **family of MosaicBERT-Base models**:
 * [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
 * mosaic-bert-base-seqlen-512
+* [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-512)
 * mosaic-bert-base-seqlen-2048 (soon)
 ## Model Date
 Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
 1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
+for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of **sequence length 512**; this covers 78.6% of C4.
 2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
 5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
 The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
+Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was **128**; since global batch size was 4096, full pretraining consisted of 70,000 batches.
+We set the **maximum sequence length during pretraining to 512**, and we used the standard embedding dimension of 768.
 For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
 Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).