Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- wikimedia/wikipedia
|
5 |
+
- bookcorpus
|
6 |
+
language:
|
7 |
+
- en
|
8 |
+
inference: false
|
9 |
+
---
|
10 |
+
|
11 |
+
# nomic-bert-2048: A 2048 Sequence Length Pretrained BERT
|
12 |
+
|
13 |
+
`nomic-bert-2048` is a BERT model pretrained on `wikipedia` and `bookcorpus` with a max sequence length of 2048.
|
14 |
+
|
15 |
+
We make several modifications to our BERT training procedure inspired by [MosaicBERT](https://www.databricks.com/blog/mosaicbert).
|
16 |
+
Namely, we:
|
17 |
+
- Use [Rotary Position Embeddings](https://arxiv.org/pdf/2104.09864.pdf) to allow for context length extrapolation.
|
18 |
+
- Use SwiGLU activations as it has [been shown](https://arxiv.org/abs/2002.05202) to [improve model performance](https://www.databricks.com/blog/mosaicbert)
|
19 |
+
- Set dropout to 0
|
20 |
+
|
21 |
+
We evaluate the quality of nomic-bert-2048 on the standard [GLUE](https://gluebenchmark.com/) benchmark. We find
|
22 |
+
it performs comparably to other BERT models but with the advantage of a significantly longer context length.
|
23 |
+
|
24 |
+
| Model | Bsz | Steps | Seq | Avg | Cola | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
|
25 |
+
|-------------|-----|-------|-------|----------|----------|----------|------|------|------|------|------|------|
|
26 |
+
| NomicBERT | 4k | 100k | 2048 | 0.84 | 0.50 | 0.93 | 0.88 | 0.90 | 0.92 | 0.86 | 0.92 | 0.82 |
|
27 |
+
| RobertaBase | 8k | 500k | 512 | 0.86 | 0.64 | 0.95 | 0.90 | 0.91 | 0.92 | 0.88 | 0.93 | 0.79 |
|
28 |
+
| JinaBERTBase| 4k | 100k | 512 | 0.83 | 0.51 | 0.95 | 0.88 | 0.90 | 0.81 | 0.86 | 0.92 | 0.79 |
|
29 |
+
| MosaicBERT | 4k | 178k | 128 | 0.85 | 0.59 | 0.94 | 0.89 | 0.90 | 0.92 | 0.86 | 0.91 | 0.83 |
|
30 |
+
|
31 |
+
## Pretraining Data
|
32 |
+
|
33 |
+
We use [BookCorpus](https://huggingface.co/datasets/bookcorpus) and a 2023 dump of [wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
|
34 |
+
We pack and tokenize the sequences to 2048 tokens. If a document is shorter than 2048 tokens, we append another document until it fits 2048 tokens.
|
35 |
+
If a document is greater than 2048 tokens, we split it across multiple documents. We release the dataset [here](https://huggingface.co/datasets/nomic-ai/nomic-bert-2048-pretraining-data/)
|
36 |
+
|
37 |
+
|
38 |
+
# Usage
|
39 |
+
|
40 |
+
```python
|
41 |
+
from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline
|
42 |
+
|
43 |
+
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # `nomic-bert-2048` uses the standard BERT tokenizer
|
44 |
+
|
45 |
+
config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) # the config needs to be passed in
|
46 |
+
model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048',config=config, trust_remote_code=True)
|
47 |
+
|
48 |
+
# To use this model directly for masked language modeling
|
49 |
+
classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")
|
50 |
+
|
51 |
+
print(classifier("I [MASK] to the store yesterday."))
|
52 |
+
```
|
53 |
+
|