mmarimon commited on
Commit
cce45a9
·
1 Parent(s): df04637

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -70
README.md CHANGED
@@ -15,15 +15,105 @@ widget:
15
  ---
16
 
17
  # Biomedical-clinical language model for Spanish
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
19
 
 
 
20
  ## Tokenization and model pretraining
21
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
22
- **biomedical-clinical** corpus in Spanish collected from several sources (see next section).
23
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
24
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
25
 
26
- ## Training corpora and preprocessing
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
29
 
@@ -52,7 +142,6 @@ Eventually, the clinical corpus is concatenated to the cleaned biomedical corpus
52
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
53
 
54
 
55
-
56
  ## Evaluation and results
57
 
58
  The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
@@ -72,13 +161,26 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
72
  | ICTUSnet | **88.08** - **84.92** - **91.50** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
73
 
74
 
75
- ## Intended uses & limitations
76
 
77
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
78
 
79
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- ## Cite
 
 
82
  If you use our models, please cite our latest preprint:
83
 
84
  ```bibtex
@@ -109,71 +211,10 @@ If you use our Medical Crawler corpus, please cite the preprint:
109
 
110
  ```
111
 
112
- ---
113
-
114
- ---
115
-
116
- ## How to use
117
-
118
- ```python
119
- from transformers import AutoTokenizer, AutoModelForMaskedLM
120
-
121
- tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
122
 
123
- model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
124
 
125
- from transformers import pipeline
126
-
127
- unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
128
-
129
- unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
130
- ```
131
- ```
132
- # Output
133
- [
134
- {
135
- "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
136
- "score": 0.9855039715766907,
137
- "token": 3529,
138
- "token_str": " hipertensión"
139
- },
140
- {
141
- "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
142
- "score": 0.0039140828885138035,
143
- "token": 1945,
144
- "token_str": " diabetes"
145
- },
146
- {
147
- "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
148
- "score": 0.002484665485098958,
149
- "token": 11483,
150
- "token_str": " hipotensión"
151
- },
152
- {
153
- "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
154
- "score": 0.0023484621196985245,
155
- "token": 12238,
156
- "token_str": " Hipertensión"
157
- },
158
- {
159
- "sequence": " El único antecedente personal a reseñar era la presión arterial.",
160
- "score": 0.0008009297889657319,
161
- "token": 2267,
162
- "token_str": " presión"
163
- }
164
- ]
165
- ```
166
- ## Copyright
167
-
168
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
169
-
170
- ## Licensing information
171
-
172
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
173
-
174
- ## Funding
175
-
176
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
177
 
178
  ### Disclaimer
179
 
 
15
  ---
16
 
17
  # Biomedical-clinical language model for Spanish
18
+
19
+
20
+ ## Table of contents
21
+ <details>
22
+ <summary>Click to expand</summary>
23
+
24
+ - [Model Description](#model-description)
25
+ - [Intended Uses and Limitations](#intended-use)
26
+ - [How to Use](#how-to-use)
27
+ - [Limitations and bias](#limitations-and-bias)
28
+ - [Training](#training)
29
+ - [Training Data](#training-data)
30
+ - [Training Procedure](#training-procedure)
31
+ - [Evaluation](#evaluation)
32
+ - [Additional Information](#additional-information)
33
+ - [Contact Information](#contact-information)
34
+ - [Copyright](#copyright)
35
+ - [Licensing Information](#licensing-information)
36
+ - [Funding](#funding)
37
+ - [Citation Information](#citation-information)
38
+ - [Contributions](#contributions)
39
+ - [Disclaimer](#disclaimer)
40
+
41
+ </details>
42
+
43
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-SANIDAD/lm-biomedical-clinical-es) and read our [preprint](https://arxiv.org/abs/2109.03570) "_Carrino, C. P., Armengol-Estapé, J., Gutiérrez-Fandiño, A., Llop-Palao, J., Pàmies, M., Gonzalez-Agirre, A., & Villegas, M. (2021). Biomedical and Clinical Language Models for Spanish: On the Benefits of Domain-Specific Pretraining in a Mid-Resource Scenario._".
44
 
45
+ ## Model description
46
+
47
  ## Tokenization and model pretraining
48
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
49
+ **biomedical-clinical** corpus in Spanish collected from several sources.
50
  The training corpus has been tokenized using a byte version of [Byte-Pair Encoding (BPE)](https://github.com/openai/gpt-2)
51
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
52
 
53
+
54
+ ## Intended uses & limitations
55
+
56
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
57
+
58
+ However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
59
+
60
+
61
+
62
+ ## How to use
63
+
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
66
+
67
+ tokenizer = AutoTokenizer.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
68
+
69
+ model = AutoModelForMaskedLM.from_pretrained("BSC-TeMU/roberta-base-biomedical-es")
70
+
71
+ from transformers import pipeline
72
+
73
+ unmasker = pipeline('fill-mask', model="BSC-TeMU/roberta-base-biomedical-es")
74
+
75
+ unmasker("El único antecedente personal a reseñar era la <mask> arterial.")
76
+ ```
77
+ ```
78
+ # Output
79
+ [
80
+ {
81
+ "sequence": " El único antecedente personal a reseñar era la hipertensión arterial.",
82
+ "score": 0.9855039715766907,
83
+ "token": 3529,
84
+ "token_str": " hipertensión"
85
+ },
86
+ {
87
+ "sequence": " El único antecedente personal a reseñar era la diabetes arterial.",
88
+ "score": 0.0039140828885138035,
89
+ "token": 1945,
90
+ "token_str": " diabetes"
91
+ },
92
+ {
93
+ "sequence": " El único antecedente personal a reseñar era la hipotensión arterial.",
94
+ "score": 0.002484665485098958,
95
+ "token": 11483,
96
+ "token_str": " hipotensión"
97
+ },
98
+ {
99
+ "sequence": " El único antecedente personal a reseñar era la Hipertensión arterial.",
100
+ "score": 0.0023484621196985245,
101
+ "token": 12238,
102
+ "token_str": " Hipertensión"
103
+ },
104
+ {
105
+ "sequence": " El único antecedente personal a reseñar era la presión arterial.",
106
+ "score": 0.0008009297889657319,
107
+ "token": 2267,
108
+ "token_str": " presión"
109
+ }
110
+ ]
111
+ ```
112
+
113
+ ## Limitations and bias
114
+
115
+
116
+ ## Training
117
 
118
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers, and a real-world clinical corpus collected from more than 278K clinical documents and notes. To obtain a high-quality training corpus while retaining the idiosyncrasies of the clinical language, a cleaning pipeline has been applied only to the biomedical corpora, keeping the clinical corpus uncleaned. Essentially, the cleaning operations used are:
119
 
 
142
  | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
143
 
144
 
 
145
  ## Evaluation and results
146
 
147
  The model has been evaluated on the Named Entity Recognition (NER) using the following datasets:
 
161
  | ICTUSnet | **88.08** - **84.92** - **91.50** | 86.75 - 83.53 - 90.23 | 85.95 - 83.10 - 89.02 |
162
 
163
 
 
164
 
165
+ ## Additional information
166
 
167
+ ### Contact Information
168
+
169
+ For further information, send an email to <plantl-gob-es@bsc.es>
170
+
171
+ ### Copyright
172
+
173
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
174
+
175
+ ### Licensing information
176
+
177
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
178
+
179
+ ### Funding
180
 
181
+ This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
182
+
183
+ ### Citation Information
184
  If you use our models, please cite our latest preprint:
185
 
186
  ```bibtex
 
211
 
212
  ```
213
 
214
+ ### Contributions
 
 
 
 
 
 
 
 
 
215
 
216
+ [N/A]
217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
  ### Disclaimer
220