jacobfulano
commited on
Commit
•
ce11e47
1
Parent(s):
fdbb682
Update README.md
Browse files
README.md
CHANGED
@@ -145,7 +145,7 @@ The learning rate schedule begins with a warmup to a maximum learning rate of 5.
|
|
145 |
Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
|
146 |
We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
|
147 |
For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
148 |
-
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/
|
149 |
|
150 |
|
151 |
## Evaluation results
|
|
|
145 |
Warmup lasted for 6% of the full training duration. Global batch size was set to 4096, and microbatch size was 128; since global batch size was 4096, full pretraining consisted of 70,000 batches.
|
146 |
We set the maximum sequence length during pretraining to 128, and we used the standard embedding dimension of 768.
|
147 |
For MosaicBERT, we applied 0.1 dropout to the feedforward layers but no dropout to the FlashAttention module, as this was not possible with the OpenAI triton implementation.
|
148 |
+
Full configuration details for pretraining MosaicBERT-Base can be found in the configuration yamls [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/yamls/main/mosaic-bert-base-uncased.yaml).
|
149 |
|
150 |
|
151 |
## Evaluation results
|