jacobfulano commited on
Commit
64bd935
1 Parent(s): 5acd316

Update README.md

Browse files

Update hyperlinks to mosaicml/examples repo

Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -20,7 +20,7 @@ March 2023
20
  ## Documentation
21
 
22
  * [Blog post](https://www.mosaicml.com/blog/mosaicbert)
23
- * [Github (mosaicml/examples/bert repo)](https://github.com/mosaicml/examples/tree/aab5ef7315715509cff9e08e862d41b3cbac83ad/examples/benchmarks/bert)
24
 
25
  ## How to use
26
 
@@ -84,14 +84,14 @@ reduces the number of read/write operations between the GPU HBM (high bandwidth
84
  this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
85
  instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
86
  tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
87
- model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/d14a7c94a0f805f56a7c865802082bf6d8ac8903/examples/bert/src/bert_layers.py#L425).
88
 
89
  3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
90
  tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
91
  padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
92
  of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
93
  operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
94
- Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/bert/src/bert_padding.py).
95
 
96
  4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
97
  Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
@@ -150,7 +150,7 @@ Full configuration details for pretraining MosaicBERT-Base can be found in the c
150
 
151
  ## Evaluation results
152
 
153
- When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/blob/main/examples/bert/yamls/finetuning/glue/mosaic-bert-base-uncased.yaml)), the MosaicBERT model achieves the following GLUE results:
154
 
155
  | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
156
  |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
 
20
  ## Documentation
21
 
22
  * [Blog post](https://www.mosaicml.com/blog/mosaicbert)
23
+ * [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
24
 
25
  ## How to use
26
 
 
84
  this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
85
  instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
86
  tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
87
+ model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_layers.py#L425).
88
 
89
  3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
90
  tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
91
  padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
92
  of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
93
  operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
94
+ Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/src/bert_padding.py).
95
 
96
  4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
97
  Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
 
150
 
151
  ## Evaluation results
152
 
153
+ When fine-tuned on downstream tasks (following the [finetuning details here](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert/yamls/finetuning/glue/mosaic-bert-base-uncased.yaml)), the MosaicBERT model achieves the following GLUE results:
154
 
155
  | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
156
  |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|