PlanTL-GOB-ES
/

roberta-base-biomedical-clinical-es

@@ -15,15 +15,105 @@ widget:
 ---
 # Biomedical-clinical language model for Spanish
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
 ## Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
-**biomedical-clinical** corpus in Spanish collected from several sources (see next section).
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
-## Training corpora and preprocessing
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
@@ -52,7 +142,6 @@ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
 ## Evaluation and results
 The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
@@ -72,13 +161,26 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
 | ICTUSnet                  | **88.08** - **84.92** - **91.50**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
-## Intended uses & limitations
-The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
-However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
-## Cite
 If you use our models, please cite our latest preprint:
 ```bibtex
@@ -109,71 +211,10 @@ If you use our Medical Crawler corpus, please cite the preprint:
 ```
----
----
-## How to use
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
-model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
-from transformers import pipeline
-unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
-unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
-```
-```
-# Output
-[
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
-    "score": 0.9855039715766907,
-    "token": 3529,
-    "token_str": " hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
-    "score": 0.0039140828885138035,
-    "token": 1945,
-    "token_str": " diabetes"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
-    "score": 0.002484665485098958,
-    "token": 11483,
-    "token_str": " hipotensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
-    "score": 0.0023484621196985245,
-    "token": 12238,
-    "token_str": " Hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
-    "score": 0.0008009297889657319,
-    "token": 2267,
-    "token_str": " presión"
-  }
-]
-```
-## Copyright
-Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
-## Licensing information
-[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
-## Funding
-This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
 ### Disclaimer

 ---
 # Biomedical-clinical language model for Spanish
+## Table of contents
+<details>
+<summary>Click to expand</summary>
+- [Model Description](#model-description)
+- [Intended Uses and Limitations](#intended-use)
+- [How to Use](#how-to-use)
+- [Limitations and bias](#limitations-and-bias)
+- [Training](#training)
+  - [Training Data](#training-data)
+  - [Training Procedure](#training-procedure)
+- [Evaluation](#evaluation)
+- [Additional Information](#additional-information)
+  - [Contact Information](#contact-information)
+  - [Copyright](#copyright)
+  - [Licensing Information](#licensing-information)
+  - [Funding](#funding)
+  - [Citation Information](#citation-information)
+  - [Contributions](#contributions)
+  - [Disclaimer](#disclaimer)
+</details>
 Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
+## Model description
 ## Tokenization and model pretraining
 This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
+**biomedical-clinical** corpus in Spanish collected from several sources.
 The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
 used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
+## Intended uses & limitations
+The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
+However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
+## How to use
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
+from transformers import pipeline
+unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
+unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
+```
+```
+# Output
+[
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
+    "score": 0.9855039715766907,
+    "token": 3529,
+    "token_str": " hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
+    "score": 0.0039140828885138035,
+    "token": 1945,
+    "token_str": " diabetes"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
+    "score": 0.002484665485098958,
+    "token": 11483,
+    "token_str": " hipotensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
+    "score": 0.0023484621196985245,
+    "token": 12238,
+    "token_str": " Hipertensión"
+  },
+  {
+    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
+    "score": 0.0008009297889657319,
+    "token": 2267,
+    "token_str": " presión"
+  }
+]
+```
+## Limitations and bias
+## Training
 The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
 | PubMed                                                                                  | 1,858,966   | Open-access articles from the PubMed repository crawled in 2017.                                                                                                                                              |
 ## Evaluation and results
 The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
 | ICTUSnet                  | **88.08** - **84.92** - **91.50**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
+## Additional information
+### Contact Information
+For further information, send an email to <plantl-gob-es@bsc.es>
+### Copyright
+Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
+### Licensing information
+[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+### Funding
+This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
+### Citation Information
 If you use our models, please cite our latest preprint:
 ```bibtex
 ```
+### Contributions
+[N/A]
 ### Disclaimer