ccasimiro commited on
Commit
780a5c1
1 Parent(s): 4e99154

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -58
README.md CHANGED
@@ -58,21 +58,24 @@ The result is a medium-size biomedical corpus for Spanish composed of about 963M
58
 
59
  ## Evaluation and results
60
 
61
- The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
62
 
63
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
64
 
65
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
66
 
67
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
 
 
 
68
 
69
- The evaluation results are compared against the [mBERT](https://huggingface.co/bert-base-multilingual-cased) and [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) models:
 
 
 
 
70
 
71
- | F1 - Precision - Recall | roberta-base-biomedical-es | mBERT | BETO |
72
- |---------------------------|----------------------------|-------------------------------|-------------------------|
73
- | PharmaCoNER | **89.48** - **87.85** - **91.18** | 87.46 - 86.50 - 88.46 | 88.18 - 87.12 - 89.28 |
74
- | CANTEMIST | **83.87** - **81.70** - **86.17** | 82.61 - 81.12 - 84.15 | 82.42 - 80.91 - 84.00 |
75
- | ICTUSnet | **88.12** - **85.56** - **90.83** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
76
 
77
 
78
  ## Intended uses & limitations
@@ -86,57 +89,6 @@ To be announced soon.
86
 
87
  ---
88
 
89
- ## How to use
90
-
91
- ```python
92
- from transformers import AutoTokenizer, AutoModelForMaskedLM
93
-
94
- tokenizer = AutoTokenizer.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
95
-
96
- model = AutoModelForMaskedLM.from_pretrained("PlanTL-GOB-ES/roberta-base-biomedical-es")
97
-
98
- from transformers import pipeline
99
-
100
- unmasker = pipeline('fill-mask', model="PlanTL-GOB-ES/roberta-base-biomedical-es")
101
-
102
- unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
103
- ```
104
- ```
105
- # Output
106
- [
107
- {
108
- "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
109
- "score": 0.9855039715766907,
110
- "token": 3529,
111
- "token_str": " hipertensión"
112
- },
113
- {
114
- "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
115
- "score": 0.0039140828885138035,
116
- "token": 1945,
117
- "token_str": " diabetes"
118
- },
119
- {
120
- "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
121
- "score": 0.002484665485098958,
122
- "token": 11483,
123
- "token_str": " hipotensión"
124
- },
125
- {
126
- "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
127
- "score": 0.0023484621196985245,
128
- "token": 12238,
129
- "token_str": " Hipertensión"
130
- },
131
- {
132
- "sequence": " El único antecedente personal a reseñar era la presión arterial.",
133
- "score": 0.0008009297889657319,
134
- "token": 2267,
135
- "token_str": " presión"
136
- }
137
- ]
138
- ```
139
-
140
  ## Funding
141
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
142
 
 
58
 
59
  ## Evaluation and results
60
 
61
+ The models have been fine-tuned on three Named Entity Recognition (NER) tasks using three clinical NER datasets:
62
 
63
  - [PharmaCoNER](https://zenodo.org/record/4270158): is a track on chemical and drug mention recognition from Spanish medical texts (for more info see: https://temu.bsc.es/pharmaconer/).
64
 
65
  - [CANTEMIST](https://zenodo.org/record/3978041#.YTt5qH2xXbQ): is a shared task specifically focusing on named entity recognition of tumor morphology, in Spanish (for more info see: https://zenodo.org/record/3978041#.YTt5qH2xXbQ).
66
 
67
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
68
+
69
+ We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
70
+ The table below shows the F1 scores obtained:
71
 
72
+ | Tasks/Models | bsc-bio-es | bsc-bio-ehr-es | XLM-R-Galén | BETO-Galén | mBERT-Galén | mBERT | BioBERT | roberta-base-bne |
73
+ |--------------|--------------|----------------|--------------------|--------------|--------------|--------------|--------------|------------------|
74
+ | PharmaCoNER | 0.8907 | **0.8913** | 0.8754 | 0.8537 | 0.8594 | 0.8671 | 0.8545 | 0.8474 |
75
+ | CANTEMIST | 0.8220 | **0.8340** | 0.8078 | 0.8153 | 0.8168 | 0.8116 | 0.8070 | 0.7875 |
76
+ | ICTUSnet | 0.8727 | **0.8756** | 0.8716 | 0.8498 | 0.8509 | 0.8631 | 0.8521 | 0.8677 |
77
 
78
+ The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
 
 
 
 
79
 
80
 
81
  ## Intended uses & limitations
 
89
 
90
  ---
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## Funding
93
  This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
94