Update README

[x] mention mosaicbert.github.io
[x] change citation
[x] change code snippet to make things clearer
[x] explain how to use alibi

Files changed (1) hide show

README.md +35 -25

README.md CHANGED Viewed

@@ -9,46 +9,57 @@ inference: false
 # MosaicBERT-Base model
-MosaicBERT-Base is a new BERT architecture and training recipe optimized for fast pretraining.
 MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
 Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
 ## Model Date
 March 2023
 ## Documentation
-* [Blog post](https://www.mosaicml.com/blog/mosaicbert)
 * [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
 ## How to use
 ```python
-from transformers import AutoModelForMaskedLM
-mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', trust_remote_code=True)
-```
-The tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
-```python
-from transformers import BertTokenizer
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-```
-To use this model directly for masked language modeling, use `pipeline`:
-```python
-from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-mlm = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base', trust_remote_code=True)
-classifier = pipeline('fill-mask', model=mlm, tokenizer=tokenizer)
-classifier("I [MASK] to the store yesterday.")
 ```
 **To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#pre-training).
 **To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#fine-tuning).
@@ -58,7 +69,7 @@ classifier("I [MASK] to the store yesterday.")
 This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
 ```python
-mlm = AutoModelForMaskedLM.from_pretrained(
    'mosaicml/mosaic-bert-base',
    trust_remote_code=True,
    revision='24512df',
@@ -182,12 +193,11 @@ This model is intended to be finetuned on downstream tasks.
 Please cite this model using the following format:
 ```
-@online{Portes2023MosaicBERT,
-    author    = {Jacob Portes and Alex Trott and Daniel King and Sam Havens},
-    title     = {MosaicBERT: Pretraining BERT from Scratch for \$20},
-    year      = {2023},
-    url       = {https://www.mosaicml.com/blog/mosaicbert},
-    note      = {Accessed: 2023-03-28}, % change this date
-    urldate   = {2023-03-28} % change this date
 }
 ```

 # MosaicBERT-Base model
+MosaicBERT-Base is a custom BERT architecture and training recipe optimized for fast pretraining.
 MosaicBERT trains faster and achieves higher pretraining and finetuning accuracy when benchmarked against
 Hugging Face's [bert-base-uncased](https://huggingface.co/bert-base-uncased).
+This study motivated many of the architecture choices around MosaicML's [MPT-7B](https://huggingface.co/mosaicml/mpt-7b) and [MPT-30B](https://huggingface.co/mosaicml/mpt-30b) models.
 ## Model Date
 March 2023
 ## Documentation
+* [Project Page (mosaicbert.github.io)](mosaicbert.github.io)
 * [Github (mosaicml/examples/tree/main/examples/benchmarks/bert)](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert)
+* [Paper (NeurIPS 2023)](https://openreview.net/forum?id=5zipcfLC2Z)
+* Colab Tutorials:
+  * [MosaicBERT Tutorial Part 1: Load Pretrained Weights and Experiment with Sequence Length Extrapolation Using ALiBi](https://colab.research.google.com/drive/1r0A3QEbu4Nzs2Jl6LaiNoW5EumIVqrGc?usp=sharing)
+* [Blog Post (March 2023)](https://www.mosaicml.com/blog/mosaicbert)
 ## How to use
 ```python
+import torch
+import transformers
+from transformers import AutoModelForMaskedLM, BertTokenizer, pipeline
+from transformers import BertTokenizer, BertConfig
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # MosaicBERT uses the standard BERT tokenizer
+config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base') # the config needs to be passed in
+mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base',config=config,trust_remote_code=True)
+# To use this model directly for masked language modeling
+mosaicbert_classifier = pipeline('fill-mask', model=mosaicbert, tokenizer=tokenizer,device="cpu")
+mosaicbert_classifier("I [MASK] to the store yesterday.")
+```
+Note that the tokenizer for this model is simply the Hugging Face `bert-base-uncased` tokenizer.
+In order to take advantage of ALiBi by extrapolating to longer sequence lengths, simply change the `alibi_starting_size` flag in the
+config file and reload the model.
+```python
+config = transformers.BertConfig.from_pretrained('mosaicml/mosaic-bert-base')
+config.alibi_starting_size = 1024 # maximum sequence length updated to 4096
+mosaicbert = AutoModelForMaskedLM.from_pretrained('mosaicml/mosaic-bert-base',config=config,trust_remote_code=True)
 ```
+This simply presets the non-learned linear bias matrix in every attention block to 1024 tokens (note that this particular model was trained with a sequence length of 128 tokens).
 **To continue MLM pretraining**, follow the [MLM pre-training section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#pre-training).
 **To fine-tune this model for classification**, follow the [Single-task fine-tuning section of the mosaicml/examples/benchmarks/bert repo](https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert#fine-tuning).
 This model requires that `trust_remote_code=True` be passed to the `from_pretrained` method. This is because we train using [FlashAttention (Dao et al. 2022)](https://arxiv.org/pdf/2205.14135.pdf), which is not part of the `transformers` library and depends on [Triton](https://github.com/openai/triton) and some custom PyTorch code. Since this involves executing arbitrary code, you should consider passing a git `revision` argument that specifies the exact commit of the code, for example:
 ```python
+mosaicbert = AutoModelForMaskedLM.from_pretrained(
    'mosaicml/mosaic-bert-base',
    trust_remote_code=True,
    revision='24512df',
 Please cite this model using the following format:
 ```
+@article{portes2023MosaicBERT,
+  title={MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining},
+  author={Jacob Portes, Alexander R Trott, Sam Havens, Daniel King, Abhinav Venigalla,
+  Moin Nadeem, Nikhil Sardana, Daya Khudia, Jonathan Frankle},
+  journal={NeuRIPS https://openreview.net/pdf?id=5zipcfLC2Z},
+  year={2023},
 }
 ```