Longformer / README.md

Update README.md

bc0eba5 over 1 year ago

11.6 kB

	---
	language: en
	tags:
	- exbert
	license: apache-2.0
	datasets:
	- bookcorpus
	- wikipedia
	- trivia_qa
	---

	# Longformer

	longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
	It was introduced in
	[this paper](https://arxiv.org/abs/2004.05150) and first released in
	[this repository](https://github.com/allenai/longformer). Longformer uses a combination of a sliding window (local) attention and global attention.
	Global attention is user-configured based on the task to allow the model to learn task-specific representations.


	## Model description

	Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length.
	Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
	and demonstrate its effectiveness on the arXiv summarization dataset.

	- Transformer-based models are unable to pro-
	cess long sequences due to their self-attention
	operation, which scales quadratically with the
	sequence length. To address this limitation,
	we introduce the Longformer with an attention
	mechanism that scales linearly with sequence
	length, making it easy to process documents of
	thousands of tokens or longer. Longformer’s
	attention mechanism is a drop-in replacement
	for the standard self-attention and combines
	a local windowed attention with a task moti-
	vated global attention. Following prior work
	on long-sequence transformers, we evaluate
	Longformer on character-level language mod-
	eling and achieve state-of-the-art results on
	text8 and enwik8. In contrast to most
	prior work, we also pretrain Longformer and
	finetune it on a variety of downstream tasks.
	Our pretrained Longformer consistently out-
	performs RoBERTa on long document tasks
	and sets new state-of-the-art results on Wiki-
	Hop and TriviaQA. We finally introduce the
	Longformer-Encoder-Decoder (LED), a Long-
	former variant for supporting long document
	generative sequence-to-sequence tasks, and
	demonstrate its effectiveness on the arXiv sum-
	marization dataset.
	- The original Transformer model has a self-attention
	component with O(n^2) time and memory complexity where n is the input sequence length. To address
	this challenge, we sparsify the full self-attention
	matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
	Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence,
	making it efficient for longer sequences. This section discusses the design and implementation of
	this attention pattern.


	## Dataset and Task

	To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling)
	For finetuned tasks: WikiHop, TriviaQA, HotpotQA, OntoNotes, IMDB, Hyperpartisan

	We evaluate on text8 and enwik8, both contain
	100M characters from Wikipedia split into 90M,
	5M, 5M for train, dev, test.


	## Tokenizer with Vocabulary size

	To prepare the data for input to Longformer
	and RoBERTa, we first tokenize the question,
	answer candidates, and support contexts using
	RoBERTa’s wordpiece tokenizer.
	The special tokens [q], [/q],
	[ent], [/ent] were added to the RoBERTa
	vocabulary and randomly initialized before task
	finetuning.

	NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"

	### Computational Resources

	Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model,
	we ran experiments on 8 RTX8000 GPUs for 13 days.

	For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.

	For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.

	For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.

	Text Classification: Experiments were done on a single RTX8000 GPU."


	### Pretraining Objective

	We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.

	This bias will also affect all fine-tuned versions of this model.


	## Training Setup

	1. We train two model
	sizes, a base model and a large model. Both models
	are trained for 65K gradient updates with sequences
	length 4,096, batch size 64 (2
	18 tokens), maximum
	learning rate of 3e-5, linear warmup of 500 steps,
	followed by a power 3 polynomial decay. The rest
	of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]

	2. Hyperparameters for the best performing model for character-level language modeling

	3. Hyperparameters of the QA models. All mod-
	els use a similar scheduler with linear warmup and de-
	cay.

	3. [For coreference resolution] The maximum se-
	quence length was 384 for RoBERTa-base, chosen
	after three trials from [256, 384, 512] using the
	default hyperparameters in the original implemen-
	tation.16 For Longformer-base the sequence length
	was 4,096.....

	4. [For coreference resolution]
	.... Hyperparameter searches were minimal and con-
	sisted of grid searches of RoBERTa LR in [1e-5,
	2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
	both RoBERTa and Longformer for a fair compari-
	son. The best configuration for Longformer-base
	was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
	perparameters were the same as in the original im-
	plementation.

	5. [For text classification]

	We used Adam opti-
	mizer with batch sizes of 32 and linear warmup
	and decay with warmup steps equal to 0.1 of the
	total training steps. For both IMDB and Hyperpar-
	tisan news we did grid search of LRs [3e-5, 5e-5]
	and epochs [10, 15, 20] and found the model with
	[3e-5] and epochs 15 to work best.

	## Training procedure

	### Preprocessing

	"For WikiHop:
	To prepare the data for input to Longformer
	and RoBERTa, we first tokenize the question,
	answer candidates, and support contexts using
	RoBERTa’s wordpiece tokenizer.
	Then we
	concatenate the question and answer candi-
	dates with special tokens as [q] question
	[/q] [ent] candidate1 [/ent] ...
	[ent] candidateN [/ent]. The contexts
	are also concatenated using RoBERTa’s doc-
	ument delimiter tokens as separators: </s>
	context1 </s> ... </s> contextM
	</s>.
	The special tokens [q], [/q],
	[ent], [/ent] were added to the RoBERTa
	vocabulary and randomly initialized before task
	finetuning.

	For TriviaQA: Similar to WikiHop, we tokenize the question
	and the document using RoBERTa’s tokenizer,
	then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.

	For HotpotQA: Similar to Wikihop and
	TriviaQA, to prepare the data for input to Long-
	former, we concatenate question and then all the
	10 paragraphs in one long context. We particu-
	larly use the following input format with special
	tokens: “[CLS] [q] question [/q] <t>
	title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2
	[s] ...” where [q], [/q], <t>, </t>, [s],
	[p] are special tokens representing, question start
	and end, paragraph title start and end, and sentence,
	respectively. The special tokens were added to the
	Longformer vocabulary and randomly initialized
	before task finetuning."

	### Experiment

	1. Character level langyage modeling: a) To compare to prior work we focus on character-
	level LM (text8 and enwik8; Mahoney, 2009).

	b) Tab. 2 and 3 summarize evaluation results on
	text8 and enwik8 datasets. We achieve a new
	state-of-the-art on both text8 and enwik8 using
	the small models with BPC of 1.10 and 1.00 on
	text8 and enwik8 respectively, demonstrating
	the effectiveness of our model.

	2. Pretraining: a) We pretrain Longformer with masked language
	modeling (MLM), where the goal is to recover
	randomly masked tokens in a sequence.

	b) Table 5: MLM BPC for RoBERTa and various pre-
	trained Longformer configurations.

	3. WikiHop: Instances in WikiHop consist of: a
	question, answer candidates (ranging from two
	candidates to 79 candidates), supporting contexts
	(ranging from three paragraphs to 63 paragraphs),
	and the correct answer. The dataset does not pro-
	vide any intermediate annotation for the multihop
	reasoning chains, requiring models to instead infer
	them from the indirect answer supervision.

	4. TriviaQA: TriviaQA has more than 100K ques-
	tion, answer, document triplets for training. Doc-
	uments are Wikipedia articles, and answers are
	named entities mentioned in the article. The span
	that answers the question is not annotated, but it is
	found using simple text matching.

	5. HotpotQA: HotpotQA dataset involves answer-
	ing questions from a set of 10 paragraphs from
	10 different Wikipedia articles where 2 paragraphs
	are relevant to the question and the rest are dis-
	tractors. It includes 2 tasks of answer span ex-
	traction and evidence sentence identification. Our
	model for HotpotQA combines both answer span
	extraction and evidence extraction in one joint
	model.

	6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.
	(2019).

	7. Text classification: For classification, following
	BERT, we used a simple binary cross entropy loss
	on top of a first [CLS] token with addition of
	global attention to [CLS].

	8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
	the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
	F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.

	9. Summarization: a) We evaluate LED on the summarization task us-
	ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza-
	tion in the scientific domain.

	b) Table 11: Summarization results of Longformer-
	Encoder-Decoder (LED) on the arXiv dataset. Met-
	rics from left to right are ROUGE-1, ROUGE-2 and
	ROUGE-L."

	## Ablation

	Ablation study for WikiHop on
	the development set. All results use Longformer-
	base, fine-tuned for five epochs with identical hy-
	perparameters except where noted. Longformer
	benefits from longer sequences, global attention,
	separate projection matrices for global attention,
	MLM pretraining, and longer training. In addition,
	when configured as in RoBERTa-base (seqlen: 512,
	and n2 attention) Longformer performs slightly
	worse then RoBERTa-base, confirming that per-
	formance gains are not due to additional pretrain-
	ing. Performance drops slightly when using the
	RoBERTa model pretrained when only unfreezing
	the additional position embeddings, showing that
	Longformer can learn to use long range context in
	task specific fine-tuning with large training datasets
	such as WikiHop.


	### BibTeX entry and citation info

	```bibtex
	@article{DBLP:journals/corr/abs-2004-05150,
	author = {Iz Beltagy and
	Matthew E. Peters and
	Arman Cohan},
	title = {Longformer: The Long-Document Transformer},
	journal = {CoRR},
	volume = {abs/2004.05150},
	year = {2020},
	url = {http://arxiv.org/abs/2004.05150},
	archivePrefix = {arXiv},
	eprint = {2004.05150},
	timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
	biburl = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
	bibsource = {dblp computer science bibliography, https://dblp.org}
	}

	```