Detsutut commited on
Commit
f388a85
1 Parent(s): eeb0823

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -17,9 +17,9 @@ widget:
17
 
18
  🤗 + 📚🩺🇮🇹 = **BioBIT**
19
 
20
- In this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint. You can find the full paper, with all details you need at [this link](https://www.sciencedirect.com/science/article/pii/S1532046423001521).
21
 
22
- BioBIT is created started from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
23
 
24
  To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
25
 
@@ -41,6 +41,5 @@ Here are the results, summarized:
41
  - RE:
42
  - [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
43
  - [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
44
- More details in the paper.
45
 
46
- Feel free to contact us if you have some inquiry!
 
17
 
18
  🤗 + 📚🩺🇮🇹 = **BioBIT**
19
 
20
+ From this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint.
21
 
22
+ BioBIT stems from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
23
 
24
  To pretrain BioBIT, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
25
 
 
41
  - RE:
42
  - [CHEMPROT](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb36) = 38.16%
43
  - [BioRED](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb37) = 67.15%
 
44
 
45
+ [Check the full paper](https://www.sciencedirect.com/science/article/pii/S1532046423001521) for further details, and feel free to contact us if you have some inquiry!