English
exbert
Hitesh1501 commited on
Commit
bc0eba5
1 Parent(s): 314bbe2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +218 -192
README.md CHANGED
@@ -9,7 +9,7 @@ datasets:
9
  - trivia_qa
10
  ---
11
 
12
- # BERT base model (uncased)
13
 
14
  longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
15
  It was introduced in
@@ -24,7 +24,7 @@ Transformer-based models are unable to process long sequences due to their self-
24
  Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
25
  and demonstrate its effectiveness on the arXiv summarization dataset.
26
 
27
- - "Transformer-based models are unable to pro-
28
  cess long sequences due to their self-attention
29
  operation, which scales quadratically with the
30
  sequence length. To address this limitation,
@@ -50,223 +50,249 @@ Longformer-Encoder-Decoder (LED), a Long-
50
  former variant for supporting long document
51
  generative sequence-to-sequence tasks, and
52
  demonstrate its effectiveness on the arXiv sum-
53
- marization dataset."
54
- - Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
55
- they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
56
- predict if the two sentences were following each other or not.
57
-
58
- This way, the model learns an inner representation of the English language that can then be used to extract features
59
- useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
60
- classifier using the features produced by the BERT model as inputs.
61
-
62
- ## Model variations
63
-
64
- BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
65
- Chinese and multilingual uncased and cased versions followed shortly after.
66
- Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
67
- Other 24 smaller models are released afterward.
68
-
69
- The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
70
-
71
- | Model | #params | Language |
72
- |------------------------|--------------------------------|-------|
73
- | [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M | English |
74
- | [`bert-large-uncased`](https://huggingface.co/bert-large-uncased) | 340M | English | sub
75
- | [`bert-base-cased`](https://huggingface.co/bert-base-cased) | 110M | English |
76
- | [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M | English |
77
- | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M | Chinese |
78
- | [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
79
- | [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
80
- | [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
81
-
82
- ## Intended uses & limitations
83
-
84
- You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
85
- be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
86
- fine-tuned versions of a task that interests you.
87
-
88
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
89
- to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
90
- generation you should look at model like GPT2.
91
-
92
- ### How to use
93
-
94
- You can use this model directly with a pipeline for masked language modeling:
95
-
96
- ```python
97
- >>> from transformers import pipeline
98
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
99
- >>> unmasker("Hello I'm a [MASK] model.")
100
-
101
- [{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
102
- 'score': 0.1073106899857521,
103
- 'token': 4827,
104
- 'token_str': 'fashion'},
105
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
106
- 'score': 0.08774490654468536,
107
- 'token': 2535,
108
- 'token_str': 'role'},
109
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
110
- 'score': 0.05338378623127937,
111
- 'token': 2047,
112
- 'token_str': 'new'},
113
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
114
- 'score': 0.04667217284440994,
115
- 'token': 3565,
116
- 'token_str': 'super'},
117
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
118
- 'score': 0.027095865458250046,
119
- 'token': 2986,
120
- 'token_str': 'fine'}]
121
- ```
122
 
123
- Here is how to use this model to get the features of a given text in PyTorch:
124
 
125
- ```python
126
- from transformers import BertTokenizer, BertModel
127
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
128
- model = BertModel.from_pretrained("bert-base-uncased")
129
- text = "Replace me by any text you'd like."
130
- encoded_input = tokenizer(text, return_tensors='pt')
131
- output = model(**encoded_input)
132
- ```
133
 
134
- and in TensorFlow:
 
135
 
136
- ```python
137
- from transformers import BertTokenizer, TFBertModel
138
- tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
139
- model = TFBertModel.from_pretrained("bert-base-uncased")
140
- text = "Replace me by any text you'd like."
141
- encoded_input = tokenizer(text, return_tensors='tf')
142
- output = model(encoded_input)
143
- ```
144
 
145
- ### Limitations and bias
146
-
147
- Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
148
- predictions:
149
-
150
- ```python
151
- >>> from transformers import pipeline
152
- >>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
153
- >>> unmasker("The man worked as a [MASK].")
154
-
155
- [{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
156
- 'score': 0.09747550636529922,
157
- 'token': 10533,
158
- 'token_str': 'carpenter'},
159
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
160
- 'score': 0.0523831807076931,
161
- 'token': 15610,
162
- 'token_str': 'waiter'},
163
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
164
- 'score': 0.04962705448269844,
165
- 'token': 13362,
166
- 'token_str': 'barber'},
167
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
168
- 'score': 0.03788609802722931,
169
- 'token': 15893,
170
- 'token_str': 'mechanic'},
171
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
172
- 'score': 0.037680890411138535,
173
- 'token': 18968,
174
- 'token_str': 'salesman'}]
175
-
176
- >>> unmasker("The woman worked as a [MASK].")
177
-
178
- [{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
179
- 'score': 0.21981462836265564,
180
- 'token': 6821,
181
- 'token_str': 'nurse'},
182
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
183
- 'score': 0.1597415804862976,
184
- 'token': 13877,
185
- 'token_str': 'waitress'},
186
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
187
- 'score': 0.1154729500412941,
188
- 'token': 10850,
189
- 'token_str': 'maid'},
190
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
191
- 'score': 0.037968918681144714,
192
- 'token': 19215,
193
- 'token_str': 'prostitute'},
194
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
195
- 'score': 0.03042375110089779,
196
- 'token': 5660,
197
- 'token_str': 'cook'}]
198
- ```
199
 
200
- This bias will also affect all fine-tuned versions of this model.
201
 
202
- ## Training data
 
 
 
 
 
 
 
203
 
204
- The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
205
- unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
206
- headers).
207
 
208
- ## Training procedure
209
 
210
- ### Preprocessing
 
211
 
212
- The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
213
- then of the form:
214
 
215
- ```
216
- [CLS] Sentence A [SEP] Sentence B [SEP]
217
- ```
218
 
219
- With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
220
- the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
221
- consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
222
- "sentences" has a combined length of less than 512 tokens.
223
 
224
- The details of the masking procedure for each sentence are the following:
225
- - 15% of the tokens are masked.
226
- - In 80% of the cases, the masked tokens are replaced by `[MASK]`.
227
- - In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
228
- - In the 10% remaining cases, the masked tokens are left as is.
229
 
230
- ### Pretraining
231
 
232
- The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
233
- of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
234
- used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
235
- learning rate warmup for 10,000 steps and linear decay of the learning rate after.
236
 
237
- ## Evaluation results
238
 
239
- When fine-tuned on downstream tasks, this model achieves the following results:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
240
 
241
- Glue test results:
242
 
243
- | Task | MNLI-(m/mm) | QQP | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE | Average |
244
- |:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
245
- | | 84.6/83.4 | 71.2 | 90.5 | 93.5 | 52.1 | 85.8 | 88.9 | 66.4 | 79.6 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
246
 
247
 
248
  ### BibTeX entry and citation info
249
 
250
  ```bibtex
251
- @article{DBLP:journals/corr/abs-1810-04805,
252
- author = {Jacob Devlin and
253
- Ming{-}Wei Chang and
254
- Kenton Lee and
255
- Kristina Toutanova},
256
- title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
257
- Understanding},
258
  journal = {CoRR},
259
- volume = {abs/1810.04805},
260
- year = {2018},
261
- url = {http://arxiv.org/abs/1810.04805},
262
  archivePrefix = {arXiv},
263
- eprint = {1810.04805},
264
- timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
265
- biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
266
  bibsource = {dblp computer science bibliography, https://dblp.org}
267
  }
268
- ```
269
 
270
- <a href="https://huggingface.co/exbert/?model=bert-base-uncased">
271
- <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
272
- </a>
 
9
  - trivia_qa
10
  ---
11
 
12
+ # Longformer
13
 
14
  longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
15
  It was introduced in
 
24
  Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
25
  and demonstrate its effectiveness on the arXiv summarization dataset.
26
 
27
+ - Transformer-based models are unable to pro-
28
  cess long sequences due to their self-attention
29
  operation, which scales quadratically with the
30
  sequence length. To address this limitation,
 
50
  former variant for supporting long document
51
  generative sequence-to-sequence tasks, and
52
  demonstrate its effectiveness on the arXiv sum-
53
+ marization dataset.
54
+ - The original Transformer model has a self-attention
55
+ component with O(n^2) time and memory complexity where n is the input sequence length. To address
56
+ this challenge, we sparsify the full self-attention
57
+ matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
58
+ Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence,
59
+ making it efficient for longer sequences. This section discusses the design and implementation of
60
+ this attention pattern.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
 
62
 
63
+ ## Dataset and Task
 
 
 
 
 
 
 
64
 
65
+ To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling)
66
+ For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan
67
 
68
+ We evaluate on text8 and enwik8, both contain
69
+ 100M characters from Wikipedia split into 90M,
70
+ 5M, 5M for train, dev, test.
 
 
 
 
 
71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
+ ## Tokenizer with Vocabulary size
74
 
75
+ To prepare the data for input to Longformer
76
+ and RoBERTa, we first tokenize the question,
77
+ answer candidates, and support contexts using
78
+ RoBERTa’s wordpiece tokenizer.
79
+ The special tokens [q], [/q],
80
+ [ent], [/ent] were added to the RoBERTa
81
+ vocabulary and randomly initialized before task
82
+ finetuning.
83
 
84
+ NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"
 
 
85
 
86
+ ### Computational Resources
87
 
88
+ Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model,
89
+ we ran experiments on 8 RTX8000 GPUs for 13 days.
90
 
91
+ For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.
 
92
 
93
+ For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.
 
 
94
 
95
+ For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.
 
 
 
96
 
97
+ Text Classification: Experiments were done on a single RTX8000 GPU."
 
 
 
 
98
 
 
99
 
100
+ ### Pretraining Objective
 
 
 
101
 
102
+ We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
103
 
104
+ This bias will also affect all fine-tuned versions of this model.
105
+
106
+
107
+ ## Training Setup
108
+
109
+ 1. We train two model
110
+ sizes, a base model and a large model. Both models
111
+ are trained for 65K gradient updates with sequences
112
+ length 4,096, batch size 64 (2
113
+ 18 tokens), maximum
114
+ learning rate of 3e-5, linear warmup of 500 steps,
115
+ followed by a power 3 polynomial decay. The rest
116
+ of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]
117
+
118
+ 2. Hyperparameters for the best performing model for character-level language modeling
119
+
120
+ 3. Hyperparameters of the QA models. All mod-
121
+ els use a similar scheduler with linear warmup and de-
122
+ cay.
123
+
124
+ 3. [For coreference resolution] The maximum se-
125
+ quence length was 384 for RoBERTa-base, chosen
126
+ after three trials from [256, 384, 512] using the
127
+ default hyperparameters in the original implemen-
128
+ tation.16 For Longformer-base the sequence length
129
+ was 4,096.....
130
+
131
+ 4. [For coreference resolution]
132
+ .... Hyperparameter searches were minimal and con-
133
+ sisted of grid searches of RoBERTa LR in [1e-5,
134
+ 2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
135
+ both RoBERTa and Longformer for a fair compari-
136
+ son. The best configuration for Longformer-base
137
+ was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
138
+ perparameters were the same as in the original im-
139
+ plementation.
140
+
141
+ 5. [For text classification]
142
+
143
+ We used Adam opti-
144
+ mizer with batch sizes of 32 and linear warmup
145
+ and decay with warmup steps equal to 0.1 of the
146
+ total training steps. For both IMDB and Hyperpar-
147
+ tisan news we did grid search of LRs [3e-5, 5e-5]
148
+ and epochs [10, 15, 20] and found the model with
149
+ [3e-5] and epochs 15 to work best.
150
+
151
+ ## Training procedure
152
 
153
+ ### Preprocessing
154
 
155
+ "For WikiHop:
156
+ To prepare the data for input to Longformer
157
+ and RoBERTa, we first tokenize the question,
158
+ answer candidates, and support contexts using
159
+ RoBERTa’s wordpiece tokenizer.
160
+ Then we
161
+ concatenate the question and answer candi-
162
+ dates with special tokens as [q] question
163
+ [/q] [ent] candidate1 [/ent] ...
164
+ [ent] candidateN [/ent]. The contexts
165
+ are also concatenated using RoBERTa’s doc-
166
+ ument delimiter tokens as separators: </s>
167
+ context1 </s> ... </s> contextM
168
+ </s>.
169
+ The special tokens [q], [/q],
170
+ [ent], [/ent] were added to the RoBERTa
171
+ vocabulary and randomly initialized before task
172
+ finetuning.
173
+
174
+ For TriviaQA: Similar to WikiHop, we tokenize the question
175
+ and the document using RoBERTa’s tokenizer,
176
+ then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.
177
+
178
+ For HotpotQA: Similar to Wikihop and
179
+ TriviaQA, to prepare the data for input to Long-
180
+ former, we concatenate question and then all the
181
+ 10 paragraphs in one long context. We particu-
182
+ larly use the following input format with special
183
+ tokens: “[CLS] [q] question [/q] <t>
184
+ title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2
185
+ [s] ...” where [q], [/q], <t>, </t>, [s],
186
+ [p] are special tokens representing, question start
187
+ and end, paragraph title start and end, and sentence,
188
+ respectively. The special tokens were added to the
189
+ Longformer vocabulary and randomly initialized
190
+ before task finetuning."
191
+
192
+ ### Experiment
193
+
194
+ 1. Character level langyage modeling: a) To compare to prior work we focus on character-
195
+ level LM (text8 and enwik8; Mahoney, 2009).
196
+
197
+ b) Tab. 2 and 3 summarize evaluation results on
198
+ text8 and enwik8 datasets. We achieve a new
199
+ state-of-the-art on both text8 and enwik8 using
200
+ the small models with BPC of 1.10 and 1.00 on
201
+ text8 and enwik8 respectively, demonstrating
202
+ the effectiveness of our model.
203
+
204
+ 2. Pretraining: a) We pretrain Longformer with masked language
205
+ modeling (MLM), where the goal is to recover
206
+ randomly masked tokens in a sequence.
207
+
208
+ b) Table 5: MLM BPC for RoBERTa and various pre-
209
+ trained Longformer configurations.
210
+
211
+ 3. WikiHop: Instances in WikiHop consist of: a
212
+ question, answer candidates (ranging from two
213
+ candidates to 79 candidates), supporting contexts
214
+ (ranging from three paragraphs to 63 paragraphs),
215
+ and the correct answer. The dataset does not pro-
216
+ vide any intermediate annotation for the multihop
217
+ reasoning chains, requiring models to instead infer
218
+ them from the indirect answer supervision.
219
+
220
+ 4. TriviaQA: TriviaQA has more than 100K ques-
221
+ tion, answer, document triplets for training. Doc-
222
+ uments are Wikipedia articles, and answers are
223
+ named entities mentioned in the article. The span
224
+ that answers the question is not annotated, but it is
225
+ found using simple text matching.
226
+
227
+ 5. HotpotQA: HotpotQA dataset involves answer-
228
+ ing questions from a set of 10 paragraphs from
229
+ 10 different Wikipedia articles where 2 paragraphs
230
+ are relevant to the question and the rest are dis-
231
+ tractors. It includes 2 tasks of answer span ex-
232
+ traction and evidence sentence identification. Our
233
+ model for HotpotQA combines both answer span
234
+ extraction and evidence extraction in one joint
235
+ model.
236
+
237
+ 6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.
238
+ (2019).
239
+
240
+ 7. Text classification: For classification, following
241
+ BERT, we used a simple binary cross entropy loss
242
+ on top of a first [CLS] token with addition of
243
+ global attention to [CLS].
244
+
245
+ 8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
246
+ the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
247
+ F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.
248
+
249
+ 9. Summarization: a) We evaluate LED on the summarization task us-
250
+ ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza-
251
+ tion in the scientific domain.
252
+
253
+ b) Table 11: Summarization results of Longformer-
254
+ Encoder-Decoder (LED) on the arXiv dataset. Met-
255
+ rics from left to right are ROUGE-1, ROUGE-2 and
256
+ ROUGE-L."
257
+
258
+ ## Ablation
259
+
260
+ Ablation study for WikiHop on
261
+ the development set. All results use Longformer-
262
+ base, fine-tuned for five epochs with identical hy-
263
+ perparameters except where noted. Longformer
264
+ benefits from longer sequences, global attention,
265
+ separate projection matrices for global attention,
266
+ MLM pretraining, and longer training. In addition,
267
+ when configured as in RoBERTa-base (seqlen: 512,
268
+ and n2 attention) Longformer performs slightly
269
+ worse then RoBERTa-base, confirming that per-
270
+ formance gains are not due to additional pretrain-
271
+ ing. Performance drops slightly when using the
272
+ RoBERTa model pretrained when only unfreezing
273
+ the additional position embeddings, showing that
274
+ Longformer can learn to use long range context in
275
+ task specific fine-tuning with large training datasets
276
+ such as WikiHop.
277
 
278
 
279
  ### BibTeX entry and citation info
280
 
281
  ```bibtex
282
+ @article{DBLP:journals/corr/abs-2004-05150,
283
+ author = {Iz Beltagy and
284
+ Matthew E. Peters and
285
+ Arman Cohan},
286
+ title = {Longformer: The Long-Document Transformer},
 
 
287
  journal = {CoRR},
288
+ volume = {abs/2004.05150},
289
+ year = {2020},
290
+ url = {http://arxiv.org/abs/2004.05150},
291
  archivePrefix = {arXiv},
292
+ eprint = {2004.05150},
293
+ timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
294
+ biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
295
  bibsource = {dblp computer science bibliography, https://dblp.org}
296
  }
 
297
 
298
+ ```