Hitesh1501
/

Longformer

English

exbert

Model card Files Files and versions Community

Hitesh1501 commited on Jun 14, 2023

Commit

bc0eba5

•

1 Parent(s): 314bbe2

Update README.md

Browse files

Files changed (1) hide show

README.md +218 -192

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ datasets:
 - trivia_qa
 ---
-# BERT base model (uncased)
 longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
 It was introduced in
@@ -24,7 +24,7 @@ Transformer-based models are unable to process long sequences due to their self-
 Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
 and demonstrate its effectiveness on the arXiv summarization dataset.
-- "Transformer-based models are unable to pro-
 cess long sequences due to their self-attention
 operation, which scales quadratically with the
 sequence length. To address this limitation,
@@ -50,223 +50,249 @@ Longformer-Encoder-Decoder (LED), a Long-
 former variant for supporting long document
 generative sequence-to-sequence tasks, and
 demonstrate its effectiveness on the arXiv sum-
-marization dataset."
-- Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes
-  they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to
-  predict if the two sentences were following each other or not.
-This way, the model learns an inner representation of the English language that can then be used to extract features
-useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard
-classifier using the features produced by the BERT model as inputs.
-## Model variations
-BERT has originally been released in base and large variations, for cased and uncased input text. The uncased models also strips out an accent markers.
-Chinese and multilingual uncased and cased versions followed shortly after.
-Modified preprocessing with whole word masking has replaced subpiece masking in a following work, with the release of two models.
-Other 24 smaller models are released afterward.
-The detailed release history can be found on the [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) on github.
-| Model | #params | Language |
-|------------------------|--------------------------------|-------|
-| [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) | 110M   | English |
-| [`bert-large-uncased`](https://huggingface.co/bert-large-uncased)              | 340M    | English | sub
-| [`bert-base-cased`](https://huggingface.co/bert-base-cased)        | 110M    | English |
-| [`bert-large-cased`](https://huggingface.co/bert-large-cased) | 340M    |  English |
-| [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 110M    | Chinese |
-| [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) | 110M | Multiple |
-| [`bert-large-uncased-whole-word-masking`](https://huggingface.co/bert-large-uncased-whole-word-masking) | 340M | English |
-| [`bert-large-cased-whole-word-masking`](https://huggingface.co/bert-large-cased-whole-word-masking) | 340M | English |
-## Intended uses & limitations
-You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to
-be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=bert) to look for
-fine-tuned versions of a task that interests you.
-Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
-to make decisions, such as sequence classification, token classification or question answering. For tasks such as text
-generation you should look at model like GPT2.
-### How to use
-You can use this model directly with a pipeline for masked language modeling:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("Hello I'm a [MASK] model.")
-[{'sequence': "[CLS] hello i'm a fashion model. [SEP]",
-  'score': 0.1073106899857521,
-  'token': 4827,
-  'token_str': 'fashion'},
- {'sequence': "[CLS] hello i'm a role model. [SEP]",
-  'score': 0.08774490654468536,
-  'token': 2535,
-  'token_str': 'role'},
- {'sequence': "[CLS] hello i'm a new model. [SEP]",
-  'score': 0.05338378623127937,
-  'token': 2047,
-  'token_str': 'new'},
- {'sequence': "[CLS] hello i'm a super model. [SEP]",
-  'score': 0.04667217284440994,
-  'token': 3565,
-  'token_str': 'super'},
- {'sequence': "[CLS] hello i'm a fine model. [SEP]",
-  'score': 0.027095865458250046,
-  'token': 2986,
-  'token_str': 'fine'}]
-```
-Here is how to use this model to get the features of a given text in PyTorch:
-```python
-from transformers import BertTokenizer, BertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = BertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='pt')
-output = model(**encoded_input)
-```
-and in TensorFlow:
-```python
-from transformers import BertTokenizer, TFBertModel
-tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
-model = TFBertModel.from_pretrained("bert-base-uncased")
-text = "Replace me by any text you'd like."
-encoded_input = tokenizer(text, return_tensors='tf')
-output = model(encoded_input)
-```
-### Limitations and bias
-Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
-predictions:
-```python
->>> from transformers import pipeline
->>> unmasker = pipeline('fill-mask', model='bert-base-uncased')
->>> unmasker("The man worked as a [MASK].")
-[{'sequence': '[CLS] the man worked as a carpenter. [SEP]',
-  'score': 0.09747550636529922,
-  'token': 10533,
-  'token_str': 'carpenter'},
- {'sequence': '[CLS] the man worked as a waiter. [SEP]',
-  'score': 0.0523831807076931,
-  'token': 15610,
-  'token_str': 'waiter'},
- {'sequence': '[CLS] the man worked as a barber. [SEP]',
-  'score': 0.04962705448269844,
-  'token': 13362,
-  'token_str': 'barber'},
- {'sequence': '[CLS] the man worked as a mechanic. [SEP]',
-  'score': 0.03788609802722931,
-  'token': 15893,
-  'token_str': 'mechanic'},
- {'sequence': '[CLS] the man worked as a salesman. [SEP]',
-  'score': 0.037680890411138535,
-  'token': 18968,
-  'token_str': 'salesman'}]
->>> unmasker("The woman worked as a [MASK].")
-[{'sequence': '[CLS] the woman worked as a nurse. [SEP]',
-  'score': 0.21981462836265564,
-  'token': 6821,
-  'token_str': 'nurse'},
- {'sequence': '[CLS] the woman worked as a waitress. [SEP]',
-  'score': 0.1597415804862976,
-  'token': 13877,
-  'token_str': 'waitress'},
- {'sequence': '[CLS] the woman worked as a maid. [SEP]',
-  'score': 0.1154729500412941,
-  'token': 10850,
-  'token_str': 'maid'},
- {'sequence': '[CLS] the woman worked as a prostitute. [SEP]',
-  'score': 0.037968918681144714,
-  'token': 19215,
-  'token_str': 'prostitute'},
- {'sequence': '[CLS] the woman worked as a cook. [SEP]',
-  'score': 0.03042375110089779,
-  'token': 5660,
-  'token_str': 'cook'}]
-```
-This bias will also affect all fine-tuned versions of this model.
-## Training data
-The BERT model was pretrained on [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset consisting of 11,038
-unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia) (excluding lists, tables and
-headers).
-## Training procedure
-### Preprocessing
-The texts are lowercased and tokenized using WordPiece and a vocabulary size of 30,000. The inputs of the model are
-then of the form:
-```
-[CLS] Sentence A [SEP] Sentence B [SEP]
-```
-With probability 0.5, sentence A and sentence B correspond to two consecutive sentences in the original corpus, and in
-the other cases, it's another random sentence in the corpus. Note that what is considered a sentence here is a
-consecutive span of text usually longer than a single sentence. The only constrain is that the result with the two
-"sentences" has a combined length of less than 512 tokens.
-The details of the masking procedure for each sentence are the following:
-- 15% of the tokens are masked.
-- In 80% of the cases, the masked tokens are replaced by `[MASK]`.
-- In 10% of the cases, the masked tokens are replaced by a random token (different) from the one they replace.
-- In the 10% remaining cases, the masked tokens are left as is.
-### Pretraining
-The model was trained on 4 cloud TPUs in Pod configuration (16 TPU chips total) for one million steps with a batch size
-of 256. The sequence length was limited to 128 tokens for 90% of the steps and 512 for the remaining 10%. The optimizer
-used is Adam with a learning rate of 1e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.999\\), a weight decay of 0.01,
-learning rate warmup for 10,000 steps and linear decay of the learning rate after.
-## Evaluation results
-When fine-tuned on downstream tasks, this model achieves the following results:
-Glue test results:
-| Task | MNLI-(m/mm) | QQP  | QNLI | SST-2 | CoLA | STS-B | MRPC | RTE  | Average |
-|:----:|:-----------:|:----:|:----:|:-----:|:----:|:-----:|:----:|:----:|:-------:|
-|      | 84.6/83.4   | 71.2 | 90.5 | 93.5  | 52.1 | 85.8  | 88.9 | 66.4 | 79.6    |
 ### BibTeX entry and citation info
 ```bibtex
-@article{DBLP:journals/corr/abs-1810-04805,
-  author    = {Jacob Devlin and
-               Ming{-}Wei Chang and
-               Kenton Lee and
-               Kristina Toutanova},
-  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
-               Understanding},
   journal   = {CoRR},
-  volume    = {abs/1810.04805},
-  year      = {2018},
-  url       = {http://arxiv.org/abs/1810.04805},
   archivePrefix = {arXiv},
-  eprint    = {1810.04805},
-  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
-  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
-```
-<a href="https://huggingface.co/exbert/?model=bert-base-uncased">
-	<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
-</a>

 - trivia_qa
 ---
+# Longformer
 longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.
 It was introduced in
 Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks,
 and demonstrate its effectiveness on the arXiv summarization dataset.
+- Transformer-based models are unable to pro-
 cess long sequences due to their self-attention
 operation, which scales quadratically with the
 sequence length. To address this limitation,
 former variant for supporting long document
 generative sequence-to-sequence tasks, and
 demonstrate its effectiveness on the arXiv sum-
+marization dataset.
+- The original Transformer model has a self-attention
+component with O(n^2) time and memory complexity where n is the input sequence length. To address
+this challenge, we sparsify the full self-attention
+matrix according to an “attention pattern” specifying pairs of input locations attending to one another.
+Unlike the full self-attention, our proposed attention pattern scales linearly with the input sequence,
+making it efficient for longer sequences. This section discusses the design and implementation of
+this attention pattern.
+## Dataset and Task
+To compare to prior work we focus on character-level LM (text8 and enwik8; Mahoney, 2009) (This is for language modelling)
+For finetuned tasks:  WikiHop, TriviaQA, HotpotQA,  OntoNotes, IMDB, Hyperpartisan
+We evaluate on text8 and enwik8, both contain
+100M characters from Wikipedia split into 90M,
+5M, 5M for train, dev, test.
+## Tokenizer with Vocabulary size
+To prepare the data for input to Longformer
+and RoBERTa, we first tokenize the question,
+answer candidates, and support contexts using
+RoBERTa’s wordpiece tokenizer.
+The special tokens [q], [/q],
+[ent], [/ent] were added to the RoBERTa
+vocabulary and randomly initialized before task
+finetuning.
+NOTE: Similar strategy was performed for all tasks. And vocabulary size is similar to RoBERTa's vocabulary"
+### Computational Resources
+Character Level Language Modelling: We ran the small model experiments on 4 RTX8000 GPUs for 16 days. For the large model,
+we ran experiments on 8 RTX8000 GPUs for 13 days.
+For wikihop: All models were trained on a single RTX8000 GPU, with Longformer-base taking about a day for 5 epochs.
+For TriviaQA: We ran our experiments on 32GB V100 GPUs. Small model takes 1 day to train on 4 GPUs, while large model takes 1 day on 8 GPUs.
+For Hotpot QA: Our experiments are done on RTX8000 GPUs and training each epoch takes approximately half a day on 4 GPUs.
+Text Classification: Experiments were done on a single RTX8000 GPU."
+### Pretraining Objective
+We pretrain Longformer with masked language modeling (MLM), where the goal is to recover randomly masked tokens in a sequence.
+This bias will also affect all fine-tuned versions of this model.
+## Training Setup
+1.  We train two model
+sizes, a base model and a large model. Both models
+are trained for 65K gradient updates with sequences
+length 4,096, batch size 64 (2
+18 tokens), maximum
+learning rate of 3e-5, linear warmup of 500 steps,
+followed by a power 3 polynomial decay. The rest
+of the hyperparameters are the same as RoBERTa.[For MLM Pretraining]
+2. Hyperparameters for the best performing model for character-level language modeling
+3. Hyperparameters of the QA models. All mod-
+els use a similar scheduler with linear warmup and de-
+cay.
+3.  [For coreference resolution] The maximum se-
+quence length was 384 for RoBERTa-base, chosen
+after three trials from [256, 384, 512] using the
+default hyperparameters in the original implemen-
+tation.16 For Longformer-base the sequence length
+was 4,096.....
+4. [For coreference resolution]
+.... Hyperparameter searches were minimal and con-
+sisted of grid searches of RoBERTa LR in [1e-5,
+2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
+both RoBERTa and Longformer for a fair compari-
+son. The best configuration for Longformer-base
+was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
+perparameters were the same as in the original im-
+plementation.
+5. [For text classification]
+We used Adam opti-
+mizer with batch sizes of 32 and linear warmup
+and decay with warmup steps equal to 0.1 of the
+total training steps. For both IMDB and Hyperpar-
+tisan news we did grid search of LRs [3e-5, 5e-5]
+and epochs [10, 15, 20] and found the model with
+[3e-5] and epochs 15 to work best.
+## Training procedure
+### Preprocessing
+"For WikiHop:
+To prepare the data for input to Longformer
+and RoBERTa, we first tokenize the question,
+answer candidates, and support contexts using
+RoBERTa’s wordpiece tokenizer.
+ Then we
+concatenate the question and answer candi-
+dates with special tokens as [q] question
+[/q] [ent] candidate1 [/ent] ...
+[ent] candidateN [/ent]. The contexts
+are also concatenated using RoBERTa’s doc-
+ument delimiter tokens as separators: </s>
+context1 </s> ... </s> contextM
+</s>.
+ The special tokens [q], [/q],
+[ent], [/ent] were added to the RoBERTa
+vocabulary and randomly initialized before task
+finetuning.
+For TriviaQA: Similar to WikiHop, we tokenize the question
+and the document using RoBERTa’s tokenizer,
+then form the input as [s] question [/s] document [/s]. We truncate the document at 4,096 wordpiece to avoid it being very slow.
+For HotpotQA: Similar to Wikihop and
+TriviaQA, to prepare the data for input to Long-
+former, we concatenate question and then all the
+10 paragraphs in one long context. We particu-
+larly use the following input format with special
+tokens: “[CLS] [q] question [/q] <t>
+title1 </t> sent1,1 [s] sent1,2 [s] ... <t> title2 </t> sent2,1 [s] sent2,2
+[s] ...” where [q], [/q], <t>, </t>, [s],
+[p] are special tokens representing, question start
+and end, paragraph title start and end, and sentence,
+respectively. The special tokens were added to the
+Longformer vocabulary and randomly initialized
+before task finetuning."
+### Experiment
+1. Character level langyage modeling: a) To compare to prior work we focus on character-
+level LM (text8 and enwik8; Mahoney, 2009).
+b) Tab. 2 and 3 summarize evaluation results on
+text8 and enwik8 datasets. We achieve a new
+state-of-the-art on both text8 and enwik8 using
+the small models with BPC of 1.10 and 1.00 on
+text8 and enwik8 respectively, demonstrating
+the effectiveness of our model.
+2. Pretraining: a) We pretrain Longformer with masked language
+modeling (MLM), where the goal is to recover
+randomly masked tokens in a sequence.
+b) Table 5: MLM BPC for RoBERTa and various pre-
+trained Longformer configurations.
+3. WikiHop: Instances in WikiHop consist of: a
+question, answer candidates (ranging from two
+candidates to 79 candidates), supporting contexts
+(ranging from three paragraphs to 63 paragraphs),
+and the correct answer. The dataset does not pro-
+vide any intermediate annotation for the multihop
+reasoning chains, requiring models to instead infer
+them from the indirect answer supervision.
+4. TriviaQA: TriviaQA has more than 100K ques-
+tion, answer, document triplets for training. Doc-
+uments are Wikipedia articles, and answers are
+named entities mentioned in the article. The span
+that answers the question is not annotated, but it is
+found using simple text matching.
+5. HotpotQA: HotpotQA dataset involves answer-
+ing questions from a set of 10 paragraphs from
+10 different Wikipedia articles where 2 paragraphs
+are relevant to the question and the rest are dis-
+tractors. It includes 2 tasks of answer span ex-
+traction and evidence sentence identification. Our
+model for HotpotQA combines both answer span
+extraction and evidence extraction in one joint
+model.
+6. Coreference model: The coreference model is a straightforward adaptation of the coarse-to-fine BERT based model from Joshi et al.
+(2019).
+7.  Text classification: For classification, following
+BERT, we used a simple binary cross entropy loss
+on top of a first [CLS] token with addition of
+global attention to [CLS].
+8. Evaluation metric for finetuned tasks: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
+the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
+F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.
+9. Summarization: a) We evaluate LED on the summarization task us-
+ing the arXiv summarization dataset (Cohan et al.) which focuses on long document summariza-
+tion in the scientific domain.
+b) Table 11: Summarization results of Longformer-
+Encoder-Decoder (LED) on the arXiv dataset. Met-
+rics from left to right are ROUGE-1, ROUGE-2 and
+ROUGE-L."
+## Ablation
+Ablation study for WikiHop on
+the development set. All results use Longformer-
+base, fine-tuned for five epochs with identical hy-
+perparameters except where noted. Longformer
+benefits from longer sequences, global attention,
+separate projection matrices for global attention,
+MLM pretraining, and longer training. In addition,
+when configured as in RoBERTa-base (seqlen: 512,
+and n2 attention) Longformer performs slightly
+worse then RoBERTa-base, confirming that per-
+formance gains are not due to additional pretrain-
+ing. Performance drops slightly when using the
+RoBERTa model pretrained when only unfreezing
+the additional position embeddings, showing that
+Longformer can learn to use long range context in
+task specific fine-tuning with large training datasets
+such as WikiHop.
 ### BibTeX entry and citation info
 ```bibtex
+@article{DBLP:journals/corr/abs-2004-05150,
+  author    = {Iz Beltagy and
+               Matthew E. Peters and
+               Arman Cohan},
+  title     = {Longformer: The Long-Document Transformer},
   journal   = {CoRR},
+  volume    = {abs/2004.05150},
+  year      = {2020},
+  url       = {http://arxiv.org/abs/2004.05150},
   archivePrefix = {arXiv},
+  eprint    = {2004.05150},
+  timestamp = {Wed, 22 Apr 2020 14:29:36 +0200},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-2004-05150.bib},
   bibsource = {dblp computer science bibliography, https://dblp.org}
 }
+```