PlanTL-GOB-ES
/

bsc-bio-es

@@ -58,21 +58,24 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
 ## Evaluation and results
-The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
-The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
-| F1 - Precision - Recall | roberta-base-biomedical-es | mBERT                   | BETO                    |
-|---------------------------|----------------------------|-------------------------------|-------------------------|
-| PharmaCoNER               | **89.48** - **87.85** - **91.18**    | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
-| CANTEMIST                 | **83.87** - **81.70** - **86.17**    | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
-| ICTUSnet                  | **88.12** - **85.56** - **90.83**    | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
 ## Intended uses & limitations
@@ -86,57 +89,6 @@ To be announced soon.
 ---
-## How to use
-```python
-from transformers import AutoTokenizer, AutoModelForMaskedLM
-tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
-model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
-from transformers import pipeline
-unmasker = pipeline('fill-mask', model="PlanTL-GOB-ES/roberta-base-biomedical-es")
-unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
-```
-```
-# Output
-[
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
-    "score": 0.9855039715766907,
-    "token": 3529,
-    "token_str": " hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
-    "score": 0.0039140828885138035,
-    "token": 1945,
-    "token_str": " diabetes"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
-    "score": 0.002484665485098958,
-    "token": 11483,
-    "token_str": " hipotensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
-    "score": 0.0023484621196985245,
-    "token": 12238,
-    "token_str": " Hipertensión"
-  },
-  {
-    "sequence": " El único antecedente personal a reseñar era la presión arterial.",
-    "score": 0.0008009297889657319,
-    "token": 2267,
-    "token_str": " presión"
-  }
-]
-```
 ## Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.

 ## Evaluation and results
+The models have been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
+We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
+The table below shows the F1 scores obtained:
+| Tasks/Models | bsc-bio-es   | bsc-bio-ehr-es | XLM-R-Galén        | BETO-Galén   | mBERT-Galén  | mBERT        | BioBERT      | roberta-base-bne |
+|--------------|--------------|----------------|--------------------|--------------|--------------|--------------|--------------|------------------|
+| PharmaCoNER  | 0.8907 | **0.8913**   | 0.8754       | 0.8537 | 0.8594 | 0.8671 | 0.8545 | 0.8474     |
+| CANTEMIST    | 0.8220  | **0.8340**    | 0.8078 | 0.8153 | 0.8168 | 0.8116 | 0.8070  | 0.7875     |
+| ICTUSnet     | 0.8727 | **0.8756**   | 0.8716       | 0.8498 | 0.8509 | 0.8631 | 0.8521 | 0.8677 |
+The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
 ## Intended uses & limitations
 ---
 ## Funding
 This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.