jacobfulano
commited on
Commit
•
c721a25
1
Parent(s):
29c1999
Update README.md
Browse files
README.md
CHANGED
@@ -15,12 +15,40 @@ In order to build MosaicBERT, we adopted architectural choices from the recent t
|
|
15 |
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
|
16 |
low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
|
17 |
|
18 |
-
|
19 |
-
FlashAttention
|
20 |
reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
|
21 |
(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
|
22 |
[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
|
23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
# How to use
|
26 |
|
|
|
15 |
These include [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), [ALiBi (Press et al. 2021)](https://arxiv.org/abs/2108.12409), training in an unpadded manner,
|
16 |
low precision LayerNorm, and [Gated Linear Units (Shazeer 2020)](https://arxiv.org/abs/2002.05202).
|
17 |
|
18 |
+
### Modifications to the Attention Mechanism
|
19 |
+
1. **FlashAttention**: Attention layers are core components of the transformer architecture. The recently proposed FlashAttention layer
|
20 |
reduces the number of read/write operations between the GPU HBM (high bandwidth memory, i.e. long-term memory) and the GPU SRAM
|
21 |
(i.e. short-term memory) [[Dao et al. 2022]](https://arxiv.org/pdf/2205.14135.pdf). We used the FlashAttention module built by
|
22 |
[hazy research](https://github.com/HazyResearch/flash-attention) with [OpenAI’s triton library](https://github.com/openai/triton).
|
23 |
|
24 |
+
2. **Attention with Linear Biases (ALiBi)**: In most BERT models, the positions of tokens in a sequence are encoded with a position embedding layer;
|
25 |
+
this embedding allows subsequent layers to keep track of the order of tokens in a sequence. ALiBi eliminates position embeddings and
|
26 |
+
instead conveys this information using a bias matrix in the attention operation. It modifies the attention mechanism such that nearby
|
27 |
+
tokens strongly attend to one another [[Press et al. 2021]](https://arxiv.org/abs/2108.12409). In addition to improving the performance of the final model, ALiBi helps the
|
28 |
+
model to handle sequences longer than it saw during training. Details on our ALiBi implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/d14a7c94a0f805f56a7c865802082bf6d8ac8903/examples/bert/src/bert_layers.py#L425).
|
29 |
+
|
30 |
+
3. **Unpadding**: Standard NLP practice is to combine text sequences of different lengths into a batch, and pad the sequences with empty
|
31 |
+
tokens so that all sequence lengths are the same. During training, however, this can lead to many superfluous operations on those
|
32 |
+
padding tokens. In MosaicBERT, we take a different approach: we concatenate all the examples in a minibatch into a single sequence
|
33 |
+
of batch size 1. Results from NVIDIA and others have shown that this approach leads to speed improvements during training, since
|
34 |
+
operations are not performed on padding tokens (see for example [Zeng et al. 2022](https://arxiv.org/pdf/2208.08124.pdf)).
|
35 |
+
Details on our “unpadding” implementation can be found [in the mosaicml/examples repo here](https://github.com/mosaicml/examples/blob/main/examples/bert/src/bert_padding.py).
|
36 |
+
|
37 |
+
4. **Low Precision LayerNorm**: this small tweak forces LayerNorm modules to run in float16 or bfloat16 precision instead of float32, improving utilization.
|
38 |
+
Our implementation can be found [in the mosaicml/examples repo here](https://docs.mosaicml.com/en/v0.12.1/method_cards/low_precision_layernorm.html).
|
39 |
+
|
40 |
+
### Modifications to the Feedforward Layers
|
41 |
+
|
42 |
+
5. **Gated Linear Units (GLU)**: We used Gated Linear Units for the feedforward sublayer of a transformer. GLUs were first proposed in 2016 [[Dauphin et al. 2016]](https://arxiv.org/abs/1612.08083),
|
43 |
+
and incorporate an extra learnable matrix that “gates” the outputs of the feedforward layer. More recent work has shown that
|
44 |
+
GLUs can improve performance quality in transformers [[Shazeer, 2020](https://arxiv.org/abs/2002.05202), [Narang et al. 2021](https://arxiv.org/pdf/2102.11972.pdf)]. We used the GeLU (Gaussian-error Linear Unit)
|
45 |
+
activation function with GLU, which is sometimes referred to as GeGLU. The GeLU activation function is a smooth, fully differentiable
|
46 |
+
approximation to ReLU; we found that this led to a nominal improvement over ReLU. More details on our implementation of GLU can be found here.
|
47 |
+
The extra gating matrix in a GLU model potentially adds additional parameters to a model; we chose to augment our BERT-Base model with
|
48 |
+
additional parameters due to GLU modules as it leads to a Pareto improvement across all timescales (which is not true of all larger
|
49 |
+
models such as BERT-Large). While BERT-Base has 110 million parameters, MosaicBERT-Base has 137 million parameters. Note that
|
50 |
+
MosaicBERT-Base trains faster than BERT-Base despite having more parameters.
|
51 |
+
|
52 |
|
53 |
# How to use
|
54 |
|