mosaicml
/

mosaic-bert-base

@@ -20,7 +20,7 @@ March 2023
 ## Documentation
 * [Blog post](https://www.mosaicml.com/blog/mosaicbert)
-* [Github (mosaicml/examples/bert repo)](https://github.com/mosaicml/examples/tree/aab5ef7315715509cff9e08e862d41b3cbac83ad/examples/benchmarks/bert)
 ## How to use
@@ -84,14 +84,14 @@ reduces the number of read/write operations between the GPU HBM (high bandwidth
    this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
    instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
    tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
-   model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/d14a7c94a0f805f56a7c865802082bf6d8ac8903/examples/bert/src/bert_layers.py#L425).
 3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
    tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
    padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
    of batch size 1. Results from NVIDIA and others have shown that this approach  leads to speed improvements during training, since
    operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
-   Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/bert/src/bert_padding.py).
 4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
    Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
@@ -150,7 +150,7 @@ Full configuration details for pretraining MosaicBERT-Base can be found in the c
 ## Evaluation results
-When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/blob/main/examples/bert/yamls/finetuning/glue/mosaic-bert-base-uncased.yaml)), the MosaicBERT model achieves the following GLUE results:
 | Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
 |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|

 ## Documentation
 * [Blog post](https://www.mosaicml.com/blog/mosaicbert)
+* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
 ## How to use
    this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
    instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
    tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
+   model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_layers.py#L425).
 3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
    tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
    padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
    of batch size 1. Results from NVIDIA and others have shown that this approach  leads to speed improvements during training, since
    operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
+   Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_padding.py).
 4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
    Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
 ## Evaluation results
+When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/yamls/finetuning/glue/mosaic-bert-base-uncased.yaml)), the MosaicBERT model achieves the following GLUE results:
 | Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
 |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|