Update README.md
Browse files
README.md
CHANGED
@@ -17,9 +17,9 @@ widget:
|
|
17 |
|
18 |
🤗 + 📚🩺🇮🇹 = **BioBIT**
|
19 |
|
20 |
-
|
21 |
|
22 |
-
BioBIT
|
23 |
|
24 |
To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
|
25 |
|
@@ -41,6 +41,5 @@ Here are the results, summarized:
|
|
41 |
- RE:
|
42 |
- [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
|
43 |
- [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
|
44 |
-
More details in the paper.
|
45 |
|
46 |
-
|
|
|
17 |
|
18 |
🤗 + 📚🩺🇮🇹 = **BioBIT**
|
19 |
|
20 |
+
From this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint.
|
21 |
|
22 |
+
BioBIT stems from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
|
23 |
|
24 |
To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
|
25 |
|
|
|
41 |
- RE:
|
42 |
- [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
|
43 |
- [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
|
|
|
44 |
|
45 |
+
[Check the full paper](https://www.sciencedirect.com/science/article/pii/S1532046423001521) for further details, and feel free to contact us if you have some inquiry!
|