Neuroinformatica
commited on
Commit
•
4ddfb04
1
Parent(s):
f388a85
Update README.md
Browse files
README.md
CHANGED
@@ -19,13 +19,13 @@ widget:
|
|
19 |
|
20 |
From this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint.
|
21 |
|
22 |
-
BioBIT stems from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
|
23 |
|
24 |
-
To pretrain BioBIT
|
25 |
|
26 |
-
Due to the unavailability of an Italian equivalent for the millions of abstracts and full-text scientific papers used by English, BERT-based biomedical models, in this work we leveraged machine translation to obtain an Italian biomedical corpus based on PubMed abstracts and train BioBIT
|
27 |
|
28 |
-
BioBIT has been evaluated on 3 downstream tasks: **NER** (Named Entity Recognition), extractive **QA** (Question Answering), **RE** (Relation Extraction).
|
29 |
Here are the results, summarized:
|
30 |
- NER:
|
31 |
- [BC2GM](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb32) = 82.14%
|
|
|
19 |
|
20 |
From this repository you can download the **BioBIT** (Biomedical Bert for ITalian) checkpoint.
|
21 |
|
22 |
+
**BioBIT** stems from [Italian XXL BERT](https://huggingface.co/dbmdz/bert-base-italian-xxl-cased), obtained from a recent Wikipedia dump and various texts in Italian from the OPUS and OSCAR corpora collection, which sums up to the final corpus size of 81 GB and 13B tokens.
|
23 |
|
24 |
+
To pretrain **BioBIT**, we followed the general approach outlined in [BioBERT paper](https://arxiv.org/abs/1901.08746), built on the foundation of the BERT architecture. The pretraining objective is a combination of **MLM** (Masked Language Modelling) and **NSP** (Next Sentence Prediction). The MLM objective is based on randomly masking 15% of the input sequence, trying then to predict the missing tokens; for the NSP objective, instead, the model is given a couple of sentences and has to guess if the second comes after the first in the original document.
|
25 |
|
26 |
+
Due to the unavailability of an Italian equivalent for the millions of abstracts and full-text scientific papers used by English, BERT-based biomedical models, in this work we leveraged machine translation to obtain an Italian biomedical corpus based on PubMed abstracts and train **BioBIT**. More details in the paper.
|
27 |
|
28 |
+
**BioBIT** has been evaluated on 3 downstream tasks: **NER** (Named Entity Recognition), extractive **QA** (Question Answering), **RE** (Relation Extraction).
|
29 |
Here are the results, summarized:
|
30 |
- NER:
|
31 |
- [BC2GM](http://refhub.elsevier.com/S1532-0464(23)00152-1/sb32) = 82.14%
|