mmarimon commited on
Commit
1de8d3d
1 Parent(s): 971e2b2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -14
README.md CHANGED
@@ -15,10 +15,49 @@ widget:
15
  ---
16
 
17
  # Biomedical language model for Spanish
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- ## Tokenization and model pretraining
22
 
23
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
24
  **biomedical** corpus in Spanish collected from several sources (see next section).
@@ -26,7 +65,7 @@ The training corpus has been tokenized using a byte version of [Byte-Pair Encodi
26
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
27
 
28
 
29
- ## Training corpora and preprocessing
30
 
31
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
32
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
@@ -67,6 +106,7 @@ The model has been fine-tuned on three Named Entity Recognition (NER) tasks usin
67
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
68
 
69
  We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
 
70
  The table below shows the F1 scores obtained:
71
 
72
  | Tasks/Models | bsc-bio-es | XLM-R-Galén | BETO-Galén | mBERT-Galén | mBERT | BioBERT | roberta-base-bne |
@@ -78,13 +118,25 @@ The table below shows the F1 scores obtained:
78
  The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
79
 
80
 
81
- ## Intended uses & limitations
82
 
83
- The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
84
 
85
- However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
 
 
 
 
 
 
 
 
 
 
86
 
87
- ## Cite
 
 
88
  If you use these models, please cite our work:
89
 
90
  ```bibtext
@@ -112,17 +164,11 @@ If you use these models, please cite our work:
112
  ```
113
  ---
114
 
115
- ## Copyright
116
 
117
- Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
118
 
119
- ## Licensing information
120
 
121
- [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
122
-
123
- ## Funding
124
-
125
- This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
126
 
127
  ### Disclaimer
128
 
 
15
  ---
16
 
17
  # Biomedical language model for Spanish
18
+
19
+ ## Table of contents
20
+ <details>
21
+ <summary>Click to expand</summary>
22
+ - [Model Description](#model-description)
23
+ - [Intended Uses and Limitations](#intended-use)
24
+ - [How to Use](#how-to-use)
25
+ - [Limitations and bias](#limitations-and-bias)
26
+ - [Training](#training)
27
+ - [Training Data](#training-data)
28
+ - [Training Procedure](#training-procedure)
29
+ - [Evaluation](#evaluation)
30
+ - [Additional Information](#additional-information)
31
+ - [Contact Information](#contact-information)
32
+ - [Copyright](#copyright)
33
+ - [Licensing Information](#licensing-information)
34
+ - [Funding](#funding)
35
+ - [Citation Information](#citation-information)
36
+ - [Contributions](#contributions)
37
+ - [Disclaimer](#disclaimer)
38
+ </details>
39
+
40
+
41
+ ## Model description
42
+
43
  Biomedical pretrained language model for Spanish. For more details about the corpus, the pretraining and the evaluation, check the official [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
44
 
45
+ ## Intended uses & limitations
46
+
47
+ The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
48
+
49
+ However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
50
+
51
+ ## How to Use
52
+
53
+
54
+ ## Limitations and bias
55
+
56
+
57
+ ## Training
58
+
59
 
60
+ ### Tokenization and model pretraining
61
 
62
  This model is a [RoBERTa-based](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model trained on a
63
  **biomedical** corpus in Spanish collected from several sources (see next section).
 
65
  used in the original [RoBERTA](https://github.com/pytorch/fairseq/tree/master/examples/roberta) model with a vocabulary size of 52,000 tokens. The pretraining consists of a masked language model training at the subword level following the approach employed for the RoBERTa base model with the same hyperparameters as in the original work. The training lasted a total of 48 hours with 16 NVIDIA V100 GPUs of 16GB DDRAM, using Adam optimizer with a peak learning rate of 0.0005 and an effective batch size of 2,048 sentences.
66
 
67
 
68
+ ### Training corpora and preprocessing
69
 
70
  The training corpus is composed of several biomedical corpora in Spanish, collected from publicly available corpora and crawlers.
71
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
 
106
  - ICTUSnet: consists of 1,006 hospital discharge reports of patients admitted for stroke from 18 different Spanish hospitals. It contains more than 79,000 annotations for 51 different kinds of variables.
107
 
108
  We addressed the NER task as a token classification problem using a standard linear layer along with the BIO tagging schema. We compared our models with the general-domain Spanish [roberta-base-bne](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne), the general-domain multilingual model that supports Spanish [mBERT](https://huggingface.co/bert-base-multilingual-cased), the domain-specific English model [BioBERT](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2), and three domain-specific models based on continual pre-training, [mBERT-Galén](https://ieeexplore.ieee.org/document/9430499), [XLM-R-Galén](https://ieeexplore.ieee.org/document/9430499) and [BETO-Galén](https://ieeexplore.ieee.org/document/9430499).
109
+
110
  The table below shows the F1 scores obtained:
111
 
112
  | Tasks/Models | bsc-bio-es | XLM-R-Galén | BETO-Galén | mBERT-Galén | mBERT | BioBERT | roberta-base-bne |
 
118
  The fine-tuning scripts can be found in the official GitHub [repository](https://github.com/PlanTL-GOB-ES/lm-biomedical-clinical-es).
119
 
120
 
121
+ ## Additional information
122
 
123
+ ### Contact Information
124
 
125
+ For further information, send an email to <plantl-gob-es@bsc.es>
126
+
127
+ ### Copyright
128
+
129
+ Copyright by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) (2022)
130
+
131
+ ### Licensing information
132
+
133
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
134
+
135
+ ### Funding
136
 
137
+ This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL.
138
+
139
+ ### Cite
140
  If you use these models, please cite our work:
141
 
142
  ```bibtext
 
164
  ```
165
  ---
166
 
 
167
 
168
+ ### Contributions
169
 
170
+ [N/A]
171
 
 
 
 
 
 
172
 
173
  ### Disclaimer
174