jacobfulano commited on
Commit
3fc528a
1 Parent(s): eac65e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -9
README.md CHANGED
@@ -7,23 +7,25 @@ language:
7
  inference: false
8
  ---
9
 
10
- # MosaicBERT: mosaic-bert-base-seqlen-512 pretrained model
11
 
12
  MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
13
  MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
14
  Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
15
 
16
- __This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) and a sequence length of 512 tokens.__
17
 
18
- It is part of the family of MosaicBERT-Base models:
 
 
 
19
 
20
  * [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
21
  * mosaic-bert-base-seqlen-512
22
- * [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-1024)
23
  * mosaic-bert-base-seqlen-2048 (soon)
24
 
25
- * ALiBi allows a model trained with a sequence length n to extrapolate to sequence lengths >2n. For more details, see [Train Short, Test Long: Attention with Linear
26
- Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
27
 
28
  ## Model Date
29
 
@@ -136,7 +138,7 @@ corpora like English Wikipedia and BooksCorpus.
136
  Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
137
 
138
  1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
139
- for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 512; this covers 78.6% of C4.
140
 
141
 
142
  2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
@@ -154,8 +156,8 @@ This small constraint is something of [a magic trick among ML practitioners](htt
154
 
155
  5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
156
  The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
157
- Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
158
- We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
159
  For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
160
  Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
161
 
 
7
  inference: false
8
  ---
9
 
10
+ # MosaicBERT: mosaic-bert-base-seqlen-512 Pretrained Model
11
 
12
  MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
13
  MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
14
  Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
15
 
16
+ __This model was trained with [ALiBi](https://arxiv.org/abs/2108.12409) on a sequence length of 512 tokens.__
17
 
18
+ ALiBi allows a model trained with a sequence length n to easily extrapolate to sequence lengths >2n during finetuning. For more details, see [Train Short, Test Long: Attention with Linear
19
+ Biases Enables Input Length Extrapolation (Press et al. 2022)](https://arxiv.org/abs/2108.12409)
20
+
21
+ It is part of the **family of MosaicBERT-Base models**:
22
 
23
  * [mosaic-bert-base](https://huggingface.co/mosaicml/mosaic-bert-base) (trained on a sequence length of 128 tokens)
24
  * mosaic-bert-base-seqlen-512
25
+ * [mosaic-bert-base-seqlen-1024](https://huggingface.co/mosaicml/mosaic-bert-base-seqlen-512)
26
  * mosaic-bert-base-seqlen-2048 (soon)
27
 
28
+
 
29
 
30
  ## Model Date
31
 
 
138
  Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
139
 
140
  1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
141
+ for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of **sequence length 512**; this covers 78.6% of C4.
142
 
143
 
144
  2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
 
156
 
157
  5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
158
  The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
159
+ Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was **128**; since global batch size was 4096, full pretraining consisted of 70,000 batches.
160
+ We set the **maximum sequence length during pretraining to 512**, and we used the standard embedding dimension of 768.
161
  For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
162
  Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
163