michiyasunaga commited on
Commit
d7bc175
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ datasets:
5
+ - wikipedia
6
+ - bookcorpus
7
+ tags:
8
+ - bert
9
+ - exbert
10
+ - linkbert
11
+ - feature-extraction
12
+ - fill-mask
13
+ - question-answering
14
+ - text-classification
15
+ - token-classification
16
+ ---
17
+
18
+ ## LinkBERT-large
19
+
20
+ LinkBERT-large model pretrained on English Wikipedia articles along with hyperlink information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
21
+
22
+
23
+ ## Model description
24
+
25
+ LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
26
+
27
+ LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
28
+
29
+
30
+ ## Intended uses & limitations
31
+
32
+ The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
33
+ You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
34
+
35
+
36
+ ### How to use
37
+
38
+ To use the model to get the features of a given text in PyTorch:
39
+
40
+ ```python
41
+ from transformers import AutoTokenizer, AutoModel
42
+ tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/LinkBERT-large')
43
+ model = AutoModel.from_pretrained('michiyasunaga/LinkBERT-large')
44
+ inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
45
+ outputs = model(**inputs)
46
+ last_hidden_states = outputs.last_hidden_state
47
+ ```
48
+
49
+ For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
50
+
51
+
52
+ ## Evaluation results
53
+
54
+ When fine-tuned on downstream tasks, LinkBERT achieves the following results.
55
+
56
+ **General benchmarks ([MRQA](https://github.com/mrqa/MRQA-Shared-Task-2019) and [GLUE](https://gluebenchmark.com/)):**
57
+
58
+ | | HotpotQA | TriviaQA | SearchQA | NaturalQ | NewsQA | SQuAD | GLUE |
59
+ | ---------------------- | -------- | -------- | -------- | -------- | ------ | ----- | -------- |
60
+ | | F1 | F1 | F1 | F1 | F1 | F1 | Avg score |
61
+ | BERT-base | 76.0 | 70.3 | 74.2 | 76.5 | 65.7 | 88.7 | 79.2 |
62
+ | **LinkBERT-base** | **78.2** | **73.9** | **76.8** | **78.3** | **69.3** | **90.1** | **79.6** |
63
+ | BERT-large | 78.1 | 73.7 | 78.3 | 79.0 | 70.9 | 91.1 | 80.7 |
64
+ | **LinkBERT-large** | **80.8** | **78.2** | **80.5** | **81.0** | **72.6** | **92.7** | **81.1** |
65
+
66
+
67
+ ## Citation
68
+
69
+ If you find LinkBERT useful in your project, please cite the following:
70
+
71
+ ```bibtex
72
+ @InProceedings{yasunaga2022linkbert,
73
+ author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
74
+ title = {LinkBERT: Pretraining Language Models with Document Links},
75
+ year = {2022},
76
+ booktitle = {Association for Computational Linguistics (ACL)},
77
+ }
78
+ ```
config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "directionality": "bidi",
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "pooler_fc_size": 768,
20
+ "pooler_num_attention_heads": 12,
21
+ "pooler_num_fc_layers": 3,
22
+ "pooler_size_per_head": 128,
23
+ "pooler_type": "first_token_transform",
24
+ "position_embedding_type": "absolute",
25
+ "transformers_version": "4.9.1",
26
+ "type_vocab_size": 2,
27
+ "use_cache": true,
28
+ "vocab_size": 28996
29
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3280d127791586faa15007e8d47f1047961aed9ece99580cb00364ca085b96b
3
+ size 1334496567
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "bert-large-cased", "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff