earlyberts-seed1 / README.md
personads's picture
updated model card
d85b921 verified
---
language: en
tags:
- multiberts
- multiberts-seed_1
license: mit
datasets:
- wikimedia/wikipedia
- bookcorpus/bookcorpus
base_model:
- google/multiberts-seed_1-step_0k
library_name: transformers
---
# EarlyBERTs
**Random Seed** 1 | **Steps** 10 – 40,000
🐤 **EarlyBERTs** reproduces the [MultiBERTs](http://goo.gle/multiberts) ([Sellam et al., 2022](https://openreview.net/forum?id=K0E_F0gFDgA)), and introduces more granular checkpoints covering the initial and critical learning phases. In "The Subspace Chronicles" ([Müller-Eberstein et al., 2023](https://mxij.me/x/subspace-chronicles)), we leverage these checkpoints to study their early learning dynamics.
This suite builds on MultiBERTs and the underlying BERT architecture, covering seeds 0 – 4 for which intermediate checkpoints were originallt released. For each seed, we provide 31 additional checkpoints for steps 10, 100, 200, ..., 1,000, 2,000, ..., 20,000, 40,000, which are stored as respective model revisions (e.g., `revision=step11000`).
## Model Details
**Model Developers**
[Max Müller-Eberstein](https://mxij.me) as part of the [NLPnorth research unit](https://nlpnorth.github.io) at the [IT University of Copenhagen](https://itu.dk), Denmark.
**Variations**
EarlyBERTs cover seeds 0–4 (in respective repositories) and steps 10–40,000 (in respective model revision branches).
**Input**
Text only.
**Output**
Text and/or embeddings of the input.
Additionally, the CLS-classification head is trained on next sentence prediction as in [Devlin et al. (2019)](https://aclanthology.org/N19-1423/).
**Model Architecture**
EarlyBERTs are based on the original BERT architecture [(Devlin et al., 2019)](https://aclanthology.org/N19-1423/), and loads the respective MultiBERTs seed at step 0 as initialization.
**Research Paper**
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training ([Müller-Eberstein et al., 2023](https://mxij.me/x/subspace-chronicles)).
## Training
**Data**
As both the original BERT as well as the MultiBERTs pre-training data are not publicly available, we gather a corresponding corpus using fully public versions of both the [English Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus). Scripts to re-create the exact data ordering, sentence pairing and subword masking can be found in [the project repository](http://mxij.me/x/emnlp-2023-code).
**Hyperparameters**
We replicate the exact training hyperparameters as in MultiBERTs, and document them in [our research paper](https://mxij.me/x/subspace-chronicles). Code to reproduce our training procedure can be found in [the project repository](http://mxij.me/x/emnlp-2023-code).
## Usage
Loading the intermediate checkpoint for a specific seed and step follows the standard HF API:
```python
from transformers import AutoTokenizer, AutoModel
seed, step = 0, 7000
tokenizer = AutoTokenizer.from_pretrained(f'personads/earlyberts-seed{seed}')
model = AutoModel.from_pretrained(f'personads/earlyberts-seed{seed}', revision=f'step{step}')
```
## Citation
If you find these models useful, please cite this, as well as the original MultiBERTs works:
```
@inproceedings{muller-eberstein-etal-2023-subspace,
title = "Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training",
author = {M{\"u}ller-Eberstein, Max and
van der Goot, Rob and
Plank, Barbara and
Titov, Ivan},
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.879",
doi = "10.18653/v1/2023.findings-emnlp.879",
pages = "13190--13208"
}
```
```bibtex
@inproceedings{
sellam2022the,
title={The Multi{BERT}s: {BERT} Reproductions for Robustness Analysis},
author={Thibault Sellam and Steve Yadlowsky and Ian Tenney and Jason Wei and Naomi Saphra and Alexander D'Amour and Tal Linzen and Jasmijn Bastings and Iulia Raluca Turc and Jacob Eisenstein and Dipanjan Das and Ellie Pavlick},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=K0E_F0gFDgA}
}
```