michiyasunaga commited on
Commit
b71f5d7
0 Parent(s):
.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bin.* filter=lfs diff=lfs merge=lfs -text
5
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.model filter=lfs diff=lfs merge=lfs -text
12
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
13
+ *.onnx filter=lfs diff=lfs merge=lfs -text
14
+ *.ot filter=lfs diff=lfs merge=lfs -text
15
+ *.parquet filter=lfs diff=lfs merge=lfs -text
16
+ *.pb filter=lfs diff=lfs merge=lfs -text
17
+ *.pt filter=lfs diff=lfs merge=lfs -text
18
+ *.pth filter=lfs diff=lfs merge=lfs -text
19
+ *.rar filter=lfs diff=lfs merge=lfs -text
20
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
21
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
22
+ *.tflite filter=lfs diff=lfs merge=lfs -text
23
+ *.tgz filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: en
4
+ datasets:
5
+ - pubmed
6
+ tags:
7
+ - bert
8
+ - exbert
9
+ - linkbert
10
+ - biolinkbert
11
+ - feature-extraction
12
+ - fill-mask
13
+ - question-answering
14
+ - text-classification
15
+ - token-classification
16
+ widget:
17
+ - text: "Sunitinib is a tyrosine kinase inhibitor"
18
+ ---
19
+
20
+ ## BioLinkBERT-base
21
+
22
+ BioLinkBERT-base model pretrained on [PubMed](https://pubmed.ncbi.nlm.nih.gov/) abstracts along with citation link information. It is introduced in the paper [LinkBERT: Pretraining Language Models with Document Links (ACL 2022)](https://arxiv.org/abs/2203.15827). The code and data are available in [this repository](https://github.com/michiyasunaga/LinkBERT).
23
+
24
+ This model achieves state-of-the-art performance on several biomedical NLP benchmarks such as [BLURB](https://microsoft.github.io/BLURB/) and [MedQA-USMLE](https://github.com/jind11/MedQA).
25
+
26
+
27
+ ## Model description
28
+
29
+ LinkBERT is a transformer encoder (BERT-like) model pretrained on a large corpus of documents. It is an improvement of BERT that newly captures **document links** such as hyperlinks and citation links to include knowledge that spans across multiple documents. Specifically, it was pretrained by feeding linked documents into the same language model context, besides a single document.
30
+
31
+ LinkBERT can be used as a drop-in replacement for BERT. It achieves better performance for general language understanding tasks (e.g. text classification), and is also particularly effective for **knowledge-intensive** tasks (e.g. question answering) and **cross-document** tasks (e.g. reading comprehension, document retrieval).
32
+
33
+
34
+ ## Intended uses & limitations
35
+
36
+ The model can be used by fine-tuning on a downstream task, such as question answering, sequence classification, and token classification.
37
+ You can also use the raw model for feature extraction (i.e. obtaining embeddings for input text).
38
+
39
+
40
+ ### How to use
41
+
42
+ To use the model to get the features of a given text in PyTorch:
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModel
46
+ tokenizer = AutoTokenizer.from_pretrained('michiyasunaga/BioLinkBERT-base')
47
+ model = AutoModel.from_pretrained('michiyasunaga/BioLinkBERT-base')
48
+ inputs = tokenizer("Sunitinib is a tyrosine kinase inhibitor", return_tensors="pt")
49
+ outputs = model(**inputs)
50
+ last_hidden_states = outputs.last_hidden_state
51
+ ```
52
+
53
+ For fine-tuning, you can use [this repository](https://github.com/michiyasunaga/LinkBERT) or follow any other BERT fine-tuning codebases.
54
+
55
+
56
+ ## Evaluation results
57
+
58
+ When fine-tuned on downstream tasks, LinkBERT achieves the following results.
59
+
60
+ **Biomedical benchmarks ([BLURB](https://microsoft.github.io/BLURB/), [MedQA](https://github.com/jind11/MedQA), [MMLU](https://github.com/hendrycks/test), etc.):** BioLinkBERT attains new state-of-the-art.
61
+
62
+ | | BLURB score | PubMedQA | BioASQ | MedQA-USMLE |
63
+ | ---------------------- | -------- | -------- | ------- | -------- |
64
+ | PubmedBERT-base | 81.10 | 55.8 | 87.5 | 38.1 |
65
+ | **BioLinkBERT-base** | **83.39** | **70.2** | **91.4** | **40.0** |
66
+ | **BioLinkBERT-large** | **84.30** | **72.2** | **94.8** | **44.6** |
67
+
68
+ | | MMLU-professional medicine |
69
+ | ---------------------- | -------- |
70
+ | GPT-3 (175 params) | 38.7 |
71
+ | UnifiedQA (11B params) | 43.2 |
72
+ | **BioLinkBERT-large (340M params)** | **50.7** |
73
+
74
+
75
+ ## Citation
76
+
77
+ If you find LinkBERT useful in your project, please cite the following:
78
+
79
+ ```bibtex
80
+ @InProceedings{yasunaga2022linkbert,
81
+ author = {Michihiro Yasunaga and Jure Leskovec and Percy Liang},
82
+ title = {LinkBERT: Pretraining Language Models with Document Links},
83
+ year = {2022},
84
+ booktitle = {Association for Computational Linguistics (ACL)},
85
+ }
86
+ ```
config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "gradient_checkpointing": false,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "transformers_version": "4.9.0",
20
+ "type_vocab_size": 2,
21
+ "use_cache": true,
22
+ "vocab_size": 28895
23
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:acc5ae5f16206b893adfcf10772ee4472e24d7847145ac961097bd06129e2ece
3
+ size 433019313
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff