Fill-Mask
Transformers
PyTorch
Safetensors
English
nomic_bert
custom_code
zpn commited on
Commit
924ee0b
1 Parent(s): e75bfa2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - wikimedia/wikipedia
5
+ - bookcorpus
6
+ language:
7
+ - en
8
+ inference: false
9
+ ---
10
+
11
+ # nomic-bert-2048: A 2048 Sequence Length Pretrained BERT
12
+
13
+ `nomic-bert-2048` is a BERT model pretrained on `wikipedia` and `bookcorpus` with a max sequence length of 2048.
14
+
15
+ We make several modifications to our BERT training procedure inspired by [MosaicBERT](https://www.databricks.com/blog/mosaicbert).
16
+ Namely, we:
17
+ - Use [Rotary Position Embeddings](https://arxiv.org/pdf/2104.09864.pdf) to allow for context length extrapolation.
18
+ - Use SwiGLU activations as it has [been shown](https://arxiv.org/abs/2002.05202) to [improve model performance](https://www.databricks.com/blog/mosaicbert)
19
+ - Set dropout to 0
20
+
21
+ We evaluate the quality of nomic-bert-2048 on the standard [GLUE](https://gluebenchmark.com/) benchmark. We find
22
+ it performs comparably to other BERT models but with the advantage of a significantly longer context length.
23
+
24
+ | Model | Bsz | Steps | Seq | Avg | Cola | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
25
+ |-------------|-----|-------|-------|----------|----------|----------|------|------|------|------|------|------|
26
+ | NomicBERT | 4k | 100k | 2048 | 0.84 | 0.50 | 0.93 | 0.88 | 0.90 | 0.92 | 0.86 | 0.92 | 0.82 |
27
+ | RobertaBase | 8k | 500k | 512 | 0.86 | 0.64 | 0.95 | 0.90 | 0.91 | 0.92 | 0.88 | 0.93 | 0.79 |
28
+ | JinaBERTBase| 4k | 100k | 512 | 0.83 | 0.51 | 0.95 | 0.88 | 0.90 | 0.81 | 0.86 | 0.92 | 0.79 |
29
+ | MosaicBERT | 4k | 178k | 128 | 0.85 | 0.59 | 0.94 | 0.89 | 0.90 | 0.92 | 0.86 | 0.91 | 0.83 |
30
+
31
+ ## Pretraining Data
32
+
33
+ We use [BookCorpus](https://huggingface.co/datasets/bookcorpus) and a 2023 dump of [wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia).
34
+ We pack and tokenize the sequences to 2048 tokens. If a document is shorter than 2048 tokens, we append another document until it fits 2048 tokens.
35
+ If a document is greater than 2048 tokens, we split it across multiple documents. We release the dataset [here](https://huggingface.co/datasets/nomic-ai/nomic-bert-2048-pretraining-data/)
36
+
37
+
38
+ # Usage
39
+
40
+ ```python
41
+ from transformers import AutoModelForMaskedLM, AutoConfig, AutoTokenizer, pipeline
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased') # `nomic-bert-2048` uses the standard BERT tokenizer
44
+
45
+ config = AutoConfig.from_pretrained('nomic-ai/nomic-bert-2048', trust_remote_code=True) # the config needs to be passed in
46
+ model = AutoModelForMaskedLM.from_pretrained('nomic-ai/nomic-bert-2048',config=config, trust_remote_code=True)
47
+
48
+ # To use this model directly for masked language modeling
49
+ classifier = pipeline('fill-mask', model=model, tokenizer=tokenizer,device="cpu")
50
+
51
+ print(classifier("I [MASK] to the store yesterday."))
52
+ ```
53
+