|
--- |
|
language: en |
|
tags: |
|
- bert |
|
- long context |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# LSG model |
|
**Transformers >= 4.36.1**\ |
|
**This model relies on a custom modeling file, you need to add trust_remote_code=True**\ |
|
**See [\#13467](https://github.com/huggingface/transformers/pull/13467)** |
|
|
|
LSG ArXiv [paper](https://arxiv.org/abs/2210.15497). \ |
|
Github/conversion script is available at this [link](https://github.com/ccdv-ai/convert_checkpoint_to_lsg). |
|
|
|
* [Usage](#usage) |
|
* [Parameters](#parameters) |
|
* [Sparse selection type](#sparse-selection-type) |
|
* [Tasks](#tasks) |
|
* [Training global tokens](#training-global-tokens) |
|
|
|
This model is adapted from [BERT-base-uncased](https://huggingface.co/bert-base-uncased) without additional pretraining yet. It uses the same number of parameters/layers and the same tokenizer. |
|
|
|
This model can handle long sequences but faster and more efficiently than Longformer or BigBird (from Transformers) and relies on Local + Sparse + Global attention (LSG). |
|
|
|
The model requires sequences whose length is a multiple of the block size. The model is "adaptive" and automatically pads the sequences if needed (adaptive=True in config). It is however recommended, thanks to the tokenizer, to truncate the inputs (truncation=True) and optionally to pad with a multiple of the block size (pad_to_multiple_of=...). |
|
|
|
Support encoder-decoder but I didnt test it extensively.\ |
|
Implemented in PyTorch. |
|
|
|
![attn](attn.png) |
|
|
|
## Usage |
|
The model relies on a custom modeling file, you need to add trust_remote_code=True to use it. |
|
|
|
```python: |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
model = AutoModel.from_pretrained("ccdv/lsg-bert-base-uncased-4096", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bert-base-uncased-4096") |
|
``` |
|
|
|
## Parameters |
|
You can change various parameters like : |
|
* the number of global tokens (num_global_tokens=1) |
|
* local block size (block_size=128) |
|
* sparse block size (sparse_block_size=128) |
|
* sparsity factor (sparsity_factor=2) |
|
* mask_first_token (mask first token since it is redundant with the first global token) |
|
* see config.json file |
|
|
|
Default parameters work well in practice. If you are short on memory, reduce block sizes, increase sparsity factor and remove dropout in the attention score matrix. |
|
|
|
```python: |
|
from transformers import AutoModel |
|
|
|
model = AutoModel.from_pretrained("ccdv/lsg-bert-base-uncased-4096", |
|
trust_remote_code=True, |
|
num_global_tokens=16, |
|
block_size=64, |
|
sparse_block_size=64, |
|
attention_probs_dropout_prob=0.0 |
|
sparsity_factor=4, |
|
sparsity_type="none", |
|
mask_first_token=True |
|
) |
|
``` |
|
|
|
## Sparse selection type |
|
|
|
There are 6 different sparse selection patterns. The best type is task dependent. \ |
|
If `sparse_block_size=0` or `sparsity_type="none"`, only local attention is considered. \ |
|
Note that for sequences with length < 2*block_size, the type has no effect. |
|
* `sparsity_type="bos_pooling"` (new) |
|
* weighted average pooling using the BOS token |
|
* Works best in general, especially with a rather large sparsity_factor (8, 16, 32) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="norm"`, select highest norm tokens |
|
* Works best for a small sparsity_factor (2 to 4) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="pooling"`, use average pooling to merge tokens |
|
* Works best for a small sparsity_factor (2 to 4) |
|
* Additional parameters: |
|
* None |
|
* `sparsity_type="lsh"`, use the LSH algorithm to cluster similar tokens |
|
* Works best for a large sparsity_factor (4+) |
|
* LSH relies on random projections, thus inference may differ slightly with different seeds |
|
* Additional parameters: |
|
* lsg_num_pre_rounds=1, pre merge tokens n times before computing centroids |
|
* `sparsity_type="stride"`, use a striding mecanism per head |
|
* Each head will use different tokens strided by sparsify_factor |
|
* Not recommended if sparsify_factor > num_heads |
|
* `sparsity_type="block_stride"`, use a striding mecanism per head |
|
* Each head will use block of tokens strided by sparsify_factor |
|
* Not recommended if sparsify_factor > num_heads |
|
|
|
## Tasks |
|
Fill mask example: |
|
```python: |
|
from transformers import FillMaskPipeline, AutoModelForMaskedLM, AutoTokenizer |
|
|
|
model = AutoModelForMaskedLM.from_pretrained("ccdv/lsg-bert-base-uncased-4096", trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bert-base-uncased-4096") |
|
|
|
SENTENCES = "Paris is the [MASK] of France." |
|
pipeline = FillMaskPipeline(model, tokenizer) |
|
output = pipeline(SENTENCES) |
|
|
|
> 'Paris is the capital of France.' |
|
``` |
|
|
|
|
|
Classification example: |
|
```python: |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-bert-base-uncased-4096", |
|
trust_remote_code=True, |
|
pool_with_global=True, # pool with a global token instead of first token |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bert-base-uncased-4096") |
|
|
|
SENTENCE = "This is a test for sequence classification. " * 300 |
|
token_ids = tokenizer( |
|
SENTENCE, |
|
return_tensors="pt", |
|
#pad_to_multiple_of=... # Optional |
|
truncation=True |
|
) |
|
output = model(**token_ids) |
|
|
|
> SequenceClassifierOutput(loss=None, logits=tensor([[-0.3051, -0.1762]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None) |
|
``` |
|
|
|
## Training global tokens |
|
To train global tokens and the classification head only: |
|
```python: |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("ccdv/lsg-bert-base-uncased-4096", |
|
trust_remote_code=True, |
|
pool_with_global=True, # pool with a global token instead of first token |
|
num_global_tokens=16 |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("ccdv/lsg-bert-base-uncased-4096") |
|
|
|
for name, param in model.named_parameters(): |
|
if "global_embeddings" not in name: |
|
param.requires_grad = False |
|
else: |
|
param.required_grad = True |
|
``` |
|
|
|
**BERT** |
|
``` |
|
@article{DBLP:journals/corr/abs-1810-04805, |
|
author = {Jacob Devlin and |
|
Ming{-}Wei Chang and |
|
Kenton Lee and |
|
Kristina Toutanova}, |
|
title = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language |
|
Understanding}, |
|
journal = {CoRR}, |
|
volume = {abs/1810.04805}, |
|
year = {2018}, |
|
url = {http://arxiv.org/abs/1810.04805}, |
|
archivePrefix = {arXiv}, |
|
eprint = {1810.04805}, |
|
timestamp = {Tue, 30 Oct 2018 20:39:56 +0100}, |
|
biburl = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib}, |
|
bibsource = {dblp computer science bibliography, https://dblp.org} |
|
} |
|
``` |