lingbionlp commited on
Commit
e49befb
1 Parent(s): a9c008f

Upload 7 files

Browse files
models/bioformer-cased-v1.0/README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ ---
6
+
7
+
8
+ Bioformer is a lightweight BERT model for biomedical text mining. Bioformer uses a biomedical vocabulary and is pre-trained from scratch only on biomedical domain corpora. Our experiments show that Bioformer is 3x as fast as BERT-base, and achieves comparable or even better performance than BioBERT/PubMedBERT on downstream NLP tasks.
9
+
10
+ Bioformer has 8 layers (transformer blocks) with a hidden embedding size of 512, and the number of self-attention heads is 8. Its total number of parameters is 42,820,610.
11
+
12
+ ## Vocabulary of Bioformer
13
+ Bioformer uses a cased WordPiece vocabulary trained from a biomedical corpus, which included all PubMed abstracts (33 million, as of Feb 1, 2021) and 1 million PMC full-text articles. PMC has 3.6 million articles but we down-sampled them to 1 million such that the total size of PubMed abstracts and PMC full-text articles are approximately equal. To mitigate the out-of-vocabulary issue and include special symbols (e.g. male and female symbols) in biomedical literature, we trained Bioformer’s vocabulary from the Unicode text of the two resources. The vocabulary size of Bioformer is 32768 (2^15), which is similar to that of the original BERT.
14
+
15
+ ## Pre-training of Bioformer
16
+ Bioformer was pre-trained from scratch on the same corpus as the vocabulary (33 million PubMed abstracts + 1 million PMC full-text articles). For the masked language modeling (MLM) objective, we used whole-word masking with a masking rate of 15%. There are debates on whether the next sentence prediction (NSP) objective could improve the performance on downstream tasks. We include it in our pre-training experiment in case the prediction of the next sentence is needed by end-users. Sentence segmentation of all training text was performed using [SciSpacy](https://allenai.github.io/scispacy/).
17
+
18
+ Pre-training of Bioformer was performed on a single Cloud TPU device (TPUv2, 8 cores, 8GB memory per core). The maximum input sequence length was fixed to 512, and the batch size was set to 256. We pre-trained Bioformer for 2 million steps, which took about 8.3 days.
19
+
20
+
21
+ ## Awards
22
+
23
+ Bioformer achieved top performance (highest micro-F1 score) in the BioCreative VII COVID-19 multi-label topic classification challenge (https://biocreative.bioinformatics.udel.edu/media/store/files/2021/TRACK5_pos_1_BC7_submission_221.pdf)
24
+
25
+ ## Acknowledgment
26
+
27
+ Bioformer is partly supported by the Google TPU Research Cloud (TRC) program.
28
+
29
+ ## Questions
30
+ If you have any questions, please submit an issue here: https://github.com/WGLab/bioformer/issues
31
+
32
+ You can also send an email to Li Fang (fangli2718@gmail.com)
models/bioformer-cased-v1.0/config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "model_type": "bert",
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 512,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 2048,
12
+ "max_position_embeddings": 512,
13
+ "num_attention_heads": 8,
14
+ "num_hidden_layers": 8,
15
+ "type_vocab_size": 2,
16
+ "vocab_size": 32768
17
+ }
models/bioformer-cased-v1.0/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c7a0bc1d0cccc89d92b1295c4b5131f54f6f1929dc6024fe70def6e58670a69
3
+ size 171344886
models/bioformer-cased-v1.0/tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:208c90dd85e6e327eeb098f52ae7bea323e902c50a5a22776c974e41b078451a
3
+ size 239660032
models/bioformer-cased-v1.0/tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": false}
models/bioformer-cased-v1.0/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
models/bioformer-cl-allgram.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7de1e1c9766d8bdb799748f3edf2d6664379365af05831c0a7af05e7d804aec7
3
+ size 203005632