Monarch Mixer-BERT

An 80M checkpoint of M2-BERT, pretrained with sequence length 32768. This is a BERT-style model that has not been fine-tuned. We recommend fine-tuning it for specific use cases before using it.

Check out the paper Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture and our blog post on retrieval for more on how we trained this model for long sequence.

This model was trained by Jon Saad-Falcon, Dan Fu, and Simran Arora.

Check out our GitHub for instructions on how to download and fine-tune it!

How to use

You can load this model using Hugging Face AutoModel:

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(
  "togethercomputer/m2-bert-80M-32k-retrieval",
  trust_remote_code=True
)

You should expect to see a large error message about unused parameters for FlashFFTConv. If you'd like to load the model with FlashFFTConv, you can check out our GitHub.

Acknowledgments

Alycia Lee helped with AutoModel support.

Citation

If you use this model, or otherwise found our work valuable, you can cite us as follows:

@inproceedings{fu2023monarch,
  title={Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture},
  author={Fu, Daniel Y and Arora, Simran and Grogan, Jessica and Johnson, Isys and Eyuboglu, Sabri and Thomas, Armin W and Spector, Benjamin and Poli, Michael and Rudra, Atri and R{\'e}, Christopher},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}