jacobfulano
commited on
Commit
•
785d4c8
1
Parent(s):
ce11e47
Update README
Browse files[x] mention mosaicbert.github.io
[x] change citation
[x] change code snippet to make things clearer
[x] explain how to use alibi
README.md
CHANGED
@@ -9,46 +9,57 @@ inference: false
|
|
9 |
|
10 |
# MosaicBERT-Base model
|
11 |
|
12 |
-
MosaicBERT-Base is a
|
13 |
MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
|
14 |
Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
|
15 |
|
|
|
|
|
16 |
## Model Date
|
17 |
|
18 |
March 2023
|
19 |
|
20 |
## Documentation
|
21 |
|
22 |
-
* [
|
23 |
* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
|
|
|
|
|
|
|
|
|
24 |
|
25 |
## How to use
|
26 |
|
27 |
```python
|
28 |
-
|
29 |
-
|
30 |
-
|
|
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
35 |
-
|
36 |
-
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
37 |
-
```
|
38 |
|
39 |
-
To use this model directly for masked language modeling
|
|
|
|
|
|
|
40 |
|
41 |
-
|
42 |
-
from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
|
43 |
|
44 |
-
|
45 |
-
|
46 |
|
47 |
-
|
|
|
|
|
48 |
|
49 |
-
|
50 |
```
|
51 |
|
|
|
|
|
52 |
**To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#pre-training).
|
53 |
|
54 |
**To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#fine-tuning).
|
@@ -58,7 +69,7 @@ classifier("I [MASK] to the store yesterday.")
|
|
58 |
This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
|
59 |
|
60 |
```python
|
61 |
-
|
62 |
'mosaicml/mosaic-bert-base',
|
63 |
trust_remote_code=True,
|
64 |
revision='24512df',
|
@@ -182,12 +193,11 @@ This model is intended to be finetuned on downstream tasks.
|
|
182 |
Please cite this model using the following format:
|
183 |
|
184 |
```
|
185 |
-
@
|
186 |
-
|
187 |
-
|
188 |
-
|
189 |
-
|
190 |
-
|
191 |
-
urldate = {2023-03-28} % change this date
|
192 |
}
|
193 |
```
|
|
|
9 |
|
10 |
# MosaicBERT-Base model
|
11 |
|
12 |
+
MosaicBERT-Base is a custom BERT architecture and training recipe optimized for fast pretraining.
|
13 |
MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
|
14 |
Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
|
15 |
|
16 |
+
This study motivated many of the architecture choices around MosaicML's [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) and [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) models.
|
17 |
+
|
18 |
## Model Date
|
19 |
|
20 |
March 2023
|
21 |
|
22 |
## Documentation
|
23 |
|
24 |
+
* [Project Page (mosaicbert.github.io)](mosaicbert.github.io)
|
25 |
* [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
|
26 |
+
* [Paper (NeurIPS 2023)](https://openreview.net/forum?id=5zipcfLC2Z)
|
27 |
+
* Colab Tutorials:
|
28 |
+
* [MosaicBERT Tutorial Part 1: Load Pretrained Weights and Experiment with Sequence Length Extrapolation Using ALiBi](https://colab.research.google.com/drive/1r0A3QEbu4Nzs2Jl6LaiNoW5EumIVqrGc?usp=sharing)
|
29 |
+
* [Blog Post (March 2023)](https://www.mosaicml.com/blog/mosaicbert)
|
30 |
|
31 |
## How to use
|
32 |
|
33 |
```python
|
34 |
+
import torch
|
35 |
+
import transformers
|
36 |
+
from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
|
37 |
+
from transformers import BertTokenizer, BertConfig
|
38 |
|
39 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # MosaicBERT uses the standard BERT tokenizer
|
40 |
|
41 |
+
config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base') # the config needs to be passed in
|
42 |
+
mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base',config=config,trust_remote_code=True)
|
|
|
|
|
43 |
|
44 |
+
# To use this model directly for masked language modeling
|
45 |
+
mosaicbert_classifier = pipeline('fill-mask', model=mosaicbert, tokenizer=tokenizer,device="cpu")
|
46 |
+
mosaicbert_classifier("I [MASK] to the store yesterday.")
|
47 |
+
```
|
48 |
|
49 |
+
Note that the tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
|
|
|
50 |
|
51 |
+
In order to take advantage of ALiBi by extrapolating to longer sequence lengths, simply change the `alibi_starting_size` flag in the
|
52 |
+
config file and reload the model.
|
53 |
|
54 |
+
```python
|
55 |
+
config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base')
|
56 |
+
config.alibi_starting_size = 1024 # maximum sequence length updated to 4096
|
57 |
|
58 |
+
mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base',config=config,trust_remote_code=True)
|
59 |
```
|
60 |
|
61 |
+
This simply presets the non-learned linear bias matrix in every attention block to 1024 tokens (note that this particular model was trained with a sequence length of 128 tokens).
|
62 |
+
|
63 |
**To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#pre-training).
|
64 |
|
65 |
**To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#fine-tuning).
|
|
|
69 |
This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
|
70 |
|
71 |
```python
|
72 |
+
mosaicbert = AutoModelForMaskedLM.from_pretrained(
|
73 |
'mosaicml/mosaic-bert-base',
|
74 |
trust_remote_code=True,
|
75 |
revision='24512df',
|
|
|
193 |
Please cite this model using the following format:
|
194 |
|
195 |
```
|
196 |
+
@article{portes2023MosaicBERT,
|
197 |
+
title={MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining},
|
198 |
+
author={Jacob Portes, Alexander R Trott, Sam Havens, Daniel King, Abhinav Venigalla,
|
199 |
+
Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle},
|
200 |
+
journal={NeuRIPS https://openreview.net/pdf?id=5zipcfLC2Z},
|
201 |
+
year={2023},
|
|
|
202 |
}
|
203 |
```
|