jacobfulano
commited on
Commit
•
4695bbf
1
Parent(s):
2885f1f
Update README.md
Browse files
README.md
CHANGED
@@ -74,30 +74,28 @@ corpora like English Wikipedia and BooksCorpus.
|
|
74 |
|
75 |
Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
|
76 |
|
77 |
-
1. MosaicML Streaming Dataset
|
78 |
-
As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
|
79 |
for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4.
|
80 |
|
81 |
|
82 |
-
|
83 |
-
We used the standard Masked Language Modeling (MLM) pretraining objective.
|
84 |
While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
|
85 |
subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692). For Hugging Face BERT-Base, we used the standard 15% masking ratio.
|
86 |
However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
|
87 |
We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
|
88 |
change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).
|
89 |
|
90 |
-
|
91 |
-
We use [bf16 (bfloat16) mixed precision training](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) for all the models, where a matrix multiplication layer uses bf16
|
92 |
for the multiplication and 32-bit IEEE floating point for gradient accumulation. We found this to be more stable than using float16 mixed precision.
|
93 |
|
94 |
-
|
95 |
-
We increased the vocab size to be a multiple of 8 as well as 64 (i.e. from 30,522 to 30,528).
|
96 |
This small constraint is something of [a magic trick among ML practitioners](https://twitter.com/karpathy/status/1621578354024677377), and leads to a throughput speedup.
|
97 |
|
98 |
-
|
99 |
-
|
100 |
-
|
|
|
|
|
101 |
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
|
102 |
|
103 |
|
|
|
74 |
|
75 |
Many of these pretraining optimizations below were informed by our [BERT results for the MLPerf v2.1 speed benchmark](https://www.mosaicml.com/blog/mlperf-nlp-nov2022).
|
76 |
|
77 |
+
1. **MosaicML Streaming Dataset**: As part of our efficiency pipeline, we converted the C4 dataset to [MosaicML’s StreamingDataset format](https://www.mosaicml.com/blog/mosaicml-streamingdataset) and used this
|
|
|
78 |
for both MosaicBERT-Base and the baseline BERT-Base. For all BERT-Base models, we chose the training duration to be 286,720,000 samples of sequence length 128; this covers 78.6% of C4.
|
79 |
|
80 |
|
81 |
+
2. **Higher Masking Ratio for the Masked Language Modeling Objective**: We used the standard Masked Language Modeling (MLM) pretraining objective.
|
|
|
82 |
While the original BERT paper also included a Next Sentence Prediction (NSP) task in the pretraining objective,
|
83 |
subsequent papers have shown this to be unnecessary [Liu et al. 2019](https://arxiv.org/abs/1907.11692). For Hugging Face BERT-Base, we used the standard 15% masking ratio.
|
84 |
However, we found that a 30% masking ratio led to slight accuracy improvements in both pretraining MLM and downstream GLUE performance.
|
85 |
We therefore included this simple change as part of our MosaicBERT training recipe. Recent studies have also found that this simple
|
86 |
change can lead to downstream improvements [Wettig et al. 2022](https://arxiv.org/abs/2202.08005).
|
87 |
|
88 |
+
3. **Bfloat16 Precision**: We use [bf16 (bfloat16) mixed precision training](https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus) for all the models, where a matrix multiplication layer uses bf16
|
|
|
89 |
for the multiplication and 32-bit IEEE floating point for gradient accumulation. We found this to be more stable than using float16 mixed precision.
|
90 |
|
91 |
+
4. **Vocab Size as a Multiple of 64**: We increased the vocab size to be a multiple of 8 as well as 64 (i.e. from 30,522 to 30,528).
|
|
|
92 |
This small constraint is something of [a magic trick among ML practitioners](https://twitter.com/karpathy/status/1621578354024677377), and leads to a throughput speedup.
|
93 |
|
94 |
+
5. **Hyperparameters**: For all models, we use Decoupled AdamW with Beta_1=0.9 and Beta_2=0.98, and a weight decay value of 1.0e-5.
|
95 |
+
The learning rate schedule begins with a warmup to a maximum learning rate of 5.0e-4 followed by a linear decay to zero.
|
96 |
+
Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
|
97 |
+
We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
|
98 |
+
For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
99 |
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/bert/yamls/main).
|
100 |
|
101 |
|