Longformer Phobert base model with max input length of 4096
Experiment performed with Transformers version 4.25.1
A Longformer roberta model for long context based on vinai/phobert-base and Longformer.
Phobert model is converted to Longformer version using author's repo, then continued MLM pretraining for 5000 steps with batch size 64 on Binhvq News Corpus so the model can learn to work with the new sliding window attention.
This corpus does not contains very long documents in general so you should finetune this model using your long docment dataset on downstream task to get better results.
The final BPC is 1.926 (In my expriment, the original BPC of Phobert-base model with max input length of 256 is 2.067).
Usage
Fill mask example:
from transformers import RobertaForMaskedLM, AutoTokenizer
from transformers.models.longformer.modeling_longformer import LongformerSelfAttention
class RobertaLongSelfAttention(LongformerSelfAttention):
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value = None,
output_attentions=False,
):
attention_mask = attention_mask.squeeze(dim=2).squeeze(dim=1)
is_index_masked = attention_mask < 0
is_index_global_attn = attention_mask > 0
is_global_attn = any(is_index_global_attn.flatten())
return super().forward(hidden_states,
is_index_masked=is_index_masked,
is_index_global_attn=is_index_global_attn,
is_global_attn=is_global_attn,
attention_mask=attention_mask,
output_attentions=output_attentions)
class RobertaLongForMaskedLM(RobertaForMaskedLM):
def __init__(self, config):
super().__init__(config)
for i, layer in enumerate(self.roberta.encoder.layer):
layer.attention.self = RobertaLongSelfAttention(config, layer_id=i)
tokenizer = AutoTokenizer.from_pretrained("bluenguyen/longformer-phobert-base-4096")
model = RobertaLongForMaskedLM.from_pretrained("bluenguyen/longformer-phobert-base-4096")
TXT = (
"Hoàng_Sa và Trường_Sa là <mask> Việt_Nam ."
+ "Đó là điều không_thể chối_cãi ." * 300
+ "Bằng_chứng lịch_sử , pháp_lý về chủ_quyền của Việt_Nam với 2 quần_đảo này đã và đang được nhiều quốc_gia và cộng_đồng quốc_tế <mask> ."
)
input_ids = tokenizer([TXT], padding=True, pad_to_multiple_of=256, return_tensors="pt")["input_ids"]
logits = model(input_ids).logits
masked_index = [i.item() for i in (input_ids[0] == tokenizer.mask_token_id).nonzero()]
for index in masked_index:
probs = logits[0, index].softmax(dim=0)
values, predictions = probs.topk(3)
print(tokenizer.batch_decode([[p] for p in predictions]))
> ['của', 'lãnh_thổ', 'chủ_quyền']
> ['công_nhận', 'thừa_nhận', 'ghi_nhận']
Because this mode based on vinai/phobert-base, users should use VnCoreNLP or Python Vietnamese Toolkit(pyvi) to segment input raw texts.
More detail about Longformer can be found in author's repo.
Contact information
For personal questions related to this implementation, please contact via reddotbluename@gmail.com
- Downloads last month
- 134