|
--- |
|
language: en |
|
tags: |
|
- exbert |
|
license: apache-2.0 |
|
datasets: |
|
- bookcorpus |
|
- wikipedia |
|
- trivia_qa |
|
--- |
|
|
|
# Longformer |
|
|
|
longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096. |
|
It was introduced in |
|
[this paper](https://arxiv.org/abs/2004.05150) and first released in |
|
[this repository](https://github.com/allenai/longformer). Longformer uses a combination of a sliding window (local) attention and global attention. |
|
Global attention is user-configured based on the task to allow the model to learn task-specific representations. |
|
|
|
|
|
## Model description |
|
|
|
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. |
|
Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, |
|
and demonstrate its effectiveness on the arXiv summarization dataset. |
|
|
|
- Transformer-based models are unable to pro- |
|
cess long sequences due to their self-attention |
|
operation, which scales quadratically with the |
|
sequence length. To address this limitation, |
|
we introduce the Longformer with an attention |
|
mechanism that scales linearly with sequence |
|
length, making it easy to process documents of |
|
thousands of tokens or longer. Longformer’s |
|
attention mechanism is a drop-in replacement |
|
for the standard self-attention and combines |
|
a local windowed attention with a task moti- |
|
vated global attention. Following prior work |
|
on long-sequence transformers, we evaluate |
|
Longformer on character-level language mod- |
|
eling and achieve state-of-the-art results on |
|
text8 and enwik8. In contrast to most |
|
prior work, we also pretrain Longformer and |
|
finetune it on a variety of downstream tasks. |
|
Our pretrained Longformer consistently out- |
|
performs RoBERTa on long document tasks |
|
and sets new state-of-the-art results on Wiki- |
|
Hop and TriviaQA. We finally introduce the |
|
Longformer-Encoder-Decoder (LED), a Long- |
|
former variant for supporting long document |
|
generative sequence-to-sequence tasks, and |
|
demonstrate its effectiveness on the arXiv sum- |
|
marization dataset. |
|
- The original Transformer model has a self-attention |
|
component with O(n^2) time and memory complexity where n is the input sequence length. To address |
|
this challenge, we sparsify the full self-attention |
|
matrix according to an “attention pattern” specifying pairs of input locations attending to one another. |
|
Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence, |
|
making it efficient for longer sequences. This section discusses the design and implementation of |
|
this attention pattern. |
|
|
|
|
|
## Dataset and Task |
|
|
|
To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling) |
|
For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan |
|
|
|
We evaluate on text8 and enwik8, both contain |
|
100M characters from Wikipedia split into 90M, |
|
5M, 5M for train, dev, test. |
|
|
|
|
|
## Tokenizer with Vocabulary size |
|
|
|
To prepare the data for input to Longformer |
|
and RoBERTa, we first tokenize the question, |
|
answer candidates, and support contexts using |
|
RoBERTa’s wordpiece tokenizer. |
|
The special tokens [q], [/q], |
|
[ent], [/ent] were added to the RoBERTa |
|
vocabulary and randomly initialized before task |
|
finetuning. |
|
|
|
NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary" |
|
|
|
### Computational Resources |
|
|
|
Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model, |
|
we ran experiments on 8 RTX8000 GPUs for 13 days. |
|
|
|
For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs. |
|
|
|
For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs. |
|
|
|
For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs. |
|
|
|
Text Classification: Experiments were done on a single RTX8000 GPU." |
|
|
|
|
|
### Pretraining Objective |
|
|
|
We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence. |
|
|
|
This bias will also affect all fine-tuned versions of this model. |
|
|
|
|
|
## Training Setup |
|
|
|
1. We train two model |
|
sizes, a base model and a large model. Both models |
|
are trained for 65K gradient updates with sequences |
|
length 4,096, batch size 64 (2 |
|
18 tokens), maximum |
|
learning rate of 3e-5, linear warmup of 500 steps, |
|
followed by a power 3 polynomial decay. The rest |
|
of the hyperparameters are the same as RoBERTa.[For MLM Pretraining] |
|
|
|
2. Hyperparameters for the best performing model for character-level language modeling |
|
|
|
3. Hyperparameters of the QA models. All mod- |
|
els use a similar scheduler with linear warmup and de- |
|
cay. |
|
|
|
3. [For coreference resolution] The maximum se- |
|
quence length was 384 for RoBERTa-base, chosen |
|
after three trials from [256, 384, 512] using the |
|
default hyperparameters in the original implemen- |
|
tation.16 For Longformer-base the sequence length |
|
was 4,096..... |
|
|
|
4. [For coreference resolution] |
|
.... Hyperparameter searches were minimal and con- |
|
sisted of grid searches of RoBERTa LR in [1e-5, |
|
2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for |
|
both RoBERTa and Longformer for a fair compari- |
|
son. The best configuration for Longformer-base |
|
was RoBERTa lr=1e-5, task lr=1e-4. All other hy- |
|
perparameters were the same as in the original im- |
|
plementation. |
|
|
|
5. [For text classification] |
|
|
|
We used Adam opti- |
|
mizer with batch sizes of 32 and linear warmup |
|
and decay with warmup steps equal to 0.1 of the |
|
total training steps. For both IMDB and Hyperpar- |
|
tisan news we did grid search of LRs [3e-5, 5e-5] |
|
and epochs [10, 15, 20] and found the model with |
|
[3e-5] and epochs 15 to work best. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
"For WikiHop: |
|
To prepare the data for input to Longformer |
|
and RoBERTa, we first tokenize the question, |
|
answer candidates, and support contexts using |
|
RoBERTa’s wordpiece tokenizer. |
|
Then we |
|
concatenate the question and answer candi- |
|
dates with special tokens as [q] question |
|
[/q] [ent] candidate1 [/ent] ... |
|
[ent] candidateN [/ent]. The contexts |
|
are also concatenated using RoBERTa’s doc- |
|
ument delimiter tokens as separators: </s> |
|
context1 </s> ... </s> contextM |
|
</s>. |
|
The special tokens [q], [/q], |
|
[ent], [/ent] were added to the RoBERTa |
|
vocabulary and randomly initialized before task |
|
finetuning. |
|
|
|
For TriviaQA: Similar to WikiHop, we tokenize the question |
|
and the document using RoBERTa’s tokenizer, |
|
then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow. |
|
|
|
For HotpotQA: Similar to Wikihop and |
|
TriviaQA, to prepare the data for input to Long- |
|
former, we concatenate question and then all the |
|
10 paragraphs in one long context. We particu- |
|
larly use the following input format with special |
|
tokens: “[CLS] [q] question [/q] <t> |
|
title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2 |
|
[s] ...” where [q], [/q], <t>, </t>, [s], |
|
[p] are special tokens representing, question start |
|
and end, paragraph title start and end, and sentence, |
|
respectively. The special tokens were added to the |
|
Longformer vocabulary and randomly initialized |
|
before task finetuning." |
|
|
|
### Experiment |
|
|
|
1. Character level langyage modeling: a) To compare to prior work we focus on character- |
|
level LM (text8 and enwik8; Mahoney, 2009). |
|
|
|
b) Tab. 2 and 3 summarize evaluation results on |
|
text8 and enwik8 datasets. We achieve a new |
|
state-of-the-art on both text8 and enwik8 using |
|
the small models with BPC of 1.10 and 1.00 on |
|
text8 and enwik8 respectively, demonstrating |
|
the effectiveness of our model. |
|
|
|
2. Pretraining: a) We pretrain Longformer with masked language |
|
modeling (MLM), where the goal is to recover |
|
randomly masked tokens in a sequence. |
|
|
|
b) Table 5: MLM BPC for RoBERTa and various pre- |
|
trained Longformer configurations. |
|
|
|
3. WikiHop: Instances in WikiHop consist of: a |
|
question, answer candidates (ranging from two |
|
candidates to 79 candidates), supporting contexts |
|
(ranging from three paragraphs to 63 paragraphs), |
|
and the correct answer. The dataset does not pro- |
|
vide any intermediate annotation for the multihop |
|
reasoning chains, requiring models to instead infer |
|
them from the indirect answer supervision. |
|
|
|
4. TriviaQA: TriviaQA has more than 100K ques- |
|
tion, answer, document triplets for training. Doc- |
|
uments are Wikipedia articles, and answers are |
|
named entities mentioned in the article. The span |
|
that answers the question is not annotated, but it is |
|
found using simple text matching. |
|
|
|
5. HotpotQA: HotpotQA dataset involves answer- |
|
ing questions from a set of 10 paragraphs from |
|
10 different Wikipedia articles where 2 paragraphs |
|
are relevant to the question and the rest are dis- |
|
tractors. It includes 2 tasks of answer span ex- |
|
traction and evidence sentence identification. Our |
|
model for HotpotQA combines both answer span |
|
extraction and evidence extraction in one joint |
|
model. |
|
|
|
6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al. |
|
(2019). |
|
|
|
7. Text classification: For classification, following |
|
BERT, we used a simple binary cross entropy loss |
|
on top of a first [CLS] token with addition of |
|
global attention to [CLS]. |
|
|
|
8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on |
|
the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are |
|
F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1. |
|
|
|
9. Summarization: a) We evaluate LED on the summarization task us- |
|
ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza- |
|
tion in the scientific domain. |
|
|
|
b) Table 11: Summarization results of Longformer- |
|
Encoder-Decoder (LED) on the arXiv dataset. Met- |
|
rics from left to right are ROUGE-1, ROUGE-2 and |
|
ROUGE-L." |
|
|
|
## Ablation |
|
|
|
Ablation study for WikiHop on |
|
the development set. All results use Longformer- |
|
base, fine-tuned for five epochs with identical hy- |
|
perparameters except where noted. Longformer |
|
benefits from longer sequences, global attention, |
|
separate projection matrices for global attention, |
|
MLM pretraining, and longer training. In addition, |
|
when configured as in RoBERTa-base (seqlen: 512, |
|
and n2 attention) Longformer performs slightly |
|
worse then RoBERTa-base, confirming that per- |
|
formance gains are not due to additional pretrain- |
|
ing. Performance drops slightly when using the |
|
RoBERTa model pretrained when only unfreezing |
|
the additional position embeddings, showing that |
|
Longformer can learn to use long range context in |
|
task specific fine-tuning with large training datasets |
|
such as WikiHop. |
|
|
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@article{DBLP:journals/corr/abs-2004-05150, |
|
author = {Iz Beltagy and |
|
Matthew E. Peters and |
|
Arman Cohan}, |
|
title = {Longformer: The Long-Document Transformer}, |
|
journal = {CoRR}, |
|
volume = {abs/2004.05150}, |
|
year = {2020}, |
|
url = {http://arxiv.org/abs/2004.05150}, |
|
archivePrefix = {arXiv}, |
|
eprint = {2004.05150}, |
|
timestamp = {Wed, 22 Apr 2020 14:29:36 +0200}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
|
|
``` |
|
|