English
exbert
Longformer / README.md
Hitesh1501's picture
Update README.md
bc0eba5
|
raw
history blame
11.6 kB
---
language: en
tags:
- exbert
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- trivia_qa
---
# Longformer
longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
It was introduced in
[this paper](https://arxiv.org/abs/2004.05150) and first released in
[this repository](https://github.com/allenai/longformer). Longformer uses a combination of a sliding window (local) attention and global attention.
Global attention is user-configured based on the task to allow the model to learn task-specific representations.
## Model description
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
and demonstrate its effectiveness on the arXiv summarization dataset.
- Transformer-based models are unable to pro-
cess long sequences due to their self-attention
operation, which scales quadratically with the
sequence length. To address this limitation,
we introduce the Longformer with an attention
mechanism that scales linearly with sequence
length, making it easy to process documents of
thousands of tokens or longer. Longformer’s
attention mechanism is a drop-in replacement
for the standard self-attention and combines
a local windowed attention with a task moti-
vated global attention. Following prior work
on long-sequence transformers, we evaluate
Longformer on character-level language mod-
eling and achieve state-of-the-art results on
text8 and enwik8. In contrast to most
prior work, we also pretrain Longformer and
finetune it on a variety of downstream tasks.
Our pretrained Longformer consistently out-
performs RoBERTa on long document tasks
and sets new state-of-the-art results on Wiki-
Hop and TriviaQA. We finally introduce the
Longformer-Encoder-Decoder (LED), a Long-
former variant for supporting long document
generative sequence-to-sequence tasks, and
demonstrate its effectiveness on the arXiv sum-
marization dataset.
- The original Transformer model has a self-attention
component with O(n^2) time and memory complexity where n is the input sequence length. To address
this challenge, we sparsify the full self-attention
matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence,
making it efficient for longer sequences. This section discusses the design and implementation of
this attention pattern.
## Dataset and Task
To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling)
For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan
We evaluate on text8 and enwik8, both contain
100M characters from Wikipedia split into 90M,
5M, 5M for train, dev, test.
## Tokenizer with Vocabulary size
To prepare the data for input to Longformer
and RoBERTa, we first tokenize the question,
answer candidates, and support contexts using
RoBERTa’s wordpiece tokenizer.
The special tokens [q], [/q],
[ent], [/ent] were added to the RoBERTa
vocabulary and randomly initialized before task
finetuning.
NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"
### Computational Resources
Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model,
we ran experiments on 8 RTX8000 GPUs for 13 days.
For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.
For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.
For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.
Text Classification: Experiments were done on a single RTX8000 GPU."
### Pretraining Objective
We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
This bias will also affect all fine-tuned versions of this model.
## Training Setup
1. We train two model
sizes, a base model and a large model. Both models
are trained for 65K gradient updates with sequences
length 4,096, batch size 64 (2
18 tokens), maximum
learning rate of 3e-5, linear warmup of 500 steps,
followed by a power 3 polynomial decay. The rest
of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]
2. Hyperparameters for the best performing model for character-level language modeling
3. Hyperparameters of the QA models. All mod-
els use a similar scheduler with linear warmup and de-
cay.
3. [For coreference resolution] The maximum se-
quence length was 384 for RoBERTa-base, chosen
after three trials from [256, 384, 512] using the
default hyperparameters in the original implemen-
tation.16 For Longformer-base the sequence length
was 4,096.....
4. [For coreference resolution]
.... Hyperparameter searches were minimal and con-
sisted of grid searches of RoBERTa LR in [1e-5,
2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
both RoBERTa and Longformer for a fair compari-
son. The best configuration for Longformer-base
was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
perparameters were the same as in the original im-
plementation.
5. [For text classification]
We used Adam opti-
mizer with batch sizes of 32 and linear warmup
and decay with warmup steps equal to 0.1 of the
total training steps. For both IMDB and Hyperpar-
tisan news we did grid search of LRs [3e-5, 5e-5]
and epochs [10, 15, 20] and found the model with
[3e-5] and epochs 15 to work best.
## Training procedure
### Preprocessing
"For WikiHop:
To prepare the data for input to Longformer
and RoBERTa, we first tokenize the question,
answer candidates, and support contexts using
RoBERTa’s wordpiece tokenizer.
Then we
concatenate the question and answer candi-
dates with special tokens as [q] question
[/q] [ent] candidate1 [/ent] ...
[ent] candidateN [/ent]. The contexts
are also concatenated using RoBERTa’s doc-
ument delimiter tokens as separators: </s>
context1 </s> ... </s> contextM
</s>.
The special tokens [q], [/q],
[ent], [/ent] were added to the RoBERTa
vocabulary and randomly initialized before task
finetuning.
For TriviaQA: Similar to WikiHop, we tokenize the question
and the document using RoBERTa’s tokenizer,
then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.
For HotpotQA: Similar to Wikihop and
TriviaQA, to prepare the data for input to Long-
former, we concatenate question and then all the
10 paragraphs in one long context. We particu-
larly use the following input format with special
tokens: “[CLS] [q] question [/q] <t>
title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2
[s] ...” where [q], [/q], <t>, </t>, [s],
[p] are special tokens representing, question start
and end, paragraph title start and end, and sentence,
respectively. The special tokens were added to the
Longformer vocabulary and randomly initialized
before task finetuning."
### Experiment
1. Character level langyage modeling: a) To compare to prior work we focus on character-
level LM (text8 and enwik8; Mahoney, 2009).
b) Tab. 2 and 3 summarize evaluation results on
text8 and enwik8 datasets. We achieve a new
state-of-the-art on both text8 and enwik8 using
the small models with BPC of 1.10 and 1.00 on
text8 and enwik8 respectively, demonstrating
the effectiveness of our model.
2. Pretraining: a) We pretrain Longformer with masked language
modeling (MLM), where the goal is to recover
randomly masked tokens in a sequence.
b) Table 5: MLM BPC for RoBERTa and various pre-
trained Longformer configurations.
3. WikiHop: Instances in WikiHop consist of: a
question, answer candidates (ranging from two
candidates to 79 candidates), supporting contexts
(ranging from three paragraphs to 63 paragraphs),
and the correct answer. The dataset does not pro-
vide any intermediate annotation for the multihop
reasoning chains, requiring models to instead infer
them from the indirect answer supervision.
4. TriviaQA: TriviaQA has more than 100K ques-
tion, answer, document triplets for training. Doc-
uments are Wikipedia articles, and answers are
named entities mentioned in the article. The span
that answers the question is not annotated, but it is
found using simple text matching.
5. HotpotQA: HotpotQA dataset involves answer-
ing questions from a set of 10 paragraphs from
10 different Wikipedia articles where 2 paragraphs
are relevant to the question and the rest are dis-
tractors. It includes 2 tasks of answer span ex-
traction and evidence sentence identification. Our
model for HotpotQA combines both answer span
extraction and evidence extraction in one joint
model.
6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.
(2019).
7. Text classification: For classification, following
BERT, we used a simple binary cross entropy loss
on top of a first [CLS] token with addition of
global attention to [CLS].
8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.
9. Summarization: a) We evaluate LED on the summarization task us-
ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza-
tion in the scientific domain.
b) Table 11: Summarization results of Longformer-
Encoder-Decoder (LED) on the arXiv dataset. Met-
rics from left to right are ROUGE-1, ROUGE-2 and
ROUGE-L."
## Ablation
Ablation study for WikiHop on
the development set. All results use Longformer-
base, fine-tuned for five epochs with identical hy-
perparameters except where noted. Longformer
benefits from longer sequences, global attention,
separate projection matrices for global attention,
MLM pretraining, and longer training. In addition,
when configured as in RoBERTa-base (seqlen: 512,
and n2 attention) Longformer performs slightly
worse then RoBERTa-base, confirming that per-
formance gains are not due to additional pretrain-
ing. Performance drops slightly when using the
RoBERTa model pretrained when only unfreezing
the additional position embeddings, showing that
Longformer can learn to use long range context in
task specific fine-tuning with large training datasets
such as WikiHop.
### BibTeX entry and citation info
```bibtex
@article{DBLP:journals/corr/abs-2004-05150,
author = {Iz Beltagy and
Matthew E. Peters and
Arman Cohan},
title = {Longformer: The Long-Document Transformer},
journal = {CoRR},
volume = {abs/2004.05150},
year = {2020},
url = {http://arxiv.org/abs/2004.05150},
archivePrefix = {arXiv},
eprint = {2004.05150},
timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```