jacobfulano
commited on
Commit
•
64bd935
1
Parent(s):
5acd316
Update README.md
Browse filesUpdate hyperlinks to mosaicml/examples repo
README.md
CHANGED
@@ -20,7 +20,7 @@ March 2023
|
|
20 |
## Documentation
|
21 |
|
22 |
* [Blog post](https://www.mosaicml.com/blog/mosaicbert)
|
23 |
-
* [Github (mosaicml/examples/bert
|
24 |
|
25 |
## How to use
|
26 |
|
@@ -84,14 +84,14 @@ reduces the number of read/write operations between the GPU HBM (high bandwidth
|
|
84 |
this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
|
85 |
instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
|
86 |
tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
|
87 |
-
model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/
|
88 |
|
89 |
3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
|
90 |
tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
|
91 |
padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
|
92 |
of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
|
93 |
operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
|
94 |
-
Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/
|
95 |
|
96 |
4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
|
97 |
Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
|
@@ -150,7 +150,7 @@ Full configuration details for pretraining MosaicBERT-Base can be found in the c
|
|
150 |
|
151 |
## Evaluation results
|
152 |
|
153 |
-
When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/
|
154 |
|
155 |
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
156 |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|
|
|
20 |
## Documentation
|
21 |
|
22 |
* [Blog post](https://www.mosaicml.com/blog/mosaicbert)
|
23 |
+
* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
|
24 |
|
25 |
## How to use
|
26 |
|
|
|
84 |
this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
|
85 |
instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
|
86 |
tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
|
87 |
+
model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_layers.py#L425).
|
88 |
|
89 |
3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
|
90 |
tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
|
91 |
padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
|
92 |
of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
|
93 |
operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
|
94 |
+
Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_padding.py).
|
95 |
|
96 |
4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
|
97 |
Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
|
|
|
150 |
|
151 |
## Evaluation results
|
152 |
|
153 |
+
When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/yamls/finetuning/glue/mosaic-bert-base-uncased.yaml)), the MosaicBERT model achieves the following GLUE results:
|
154 |
|
155 |
| Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
|
156 |
|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
|