File size: 8,171 Bytes
60218d5 8daedb8 60218d5 8daedb8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
---
license: apache-2.0
language:
- en
- es
datasets:
- HiTZ/Multilingual-Medical-Corpus
tags:
- biomedical
- medical
- clinical
- spanish
- multilingual
widget:
- text: "Las radiologías óseas de cuerpo entero no detectan alteraciones <mask>, ni alteraciones vertebrales."
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
- text: "Percutaneous transluminal coronary <mask> of both the LAD and Cx was done."
- text: "Clinical examination showed a non-pulsatile, painless, axial, irreducible exophthalmia with no sign of <mask> or keratitis, and right monocular blindness, right ptosis."
---
<br>
<div style="text-align: center;">
<img src="https://huggingface.co/HiTZ/EriBERTa-base/resolve/main/eriberta_icon.png" style="height: 175px;display: block;margin-left: auto;margin-right: auto;">
</div>
<h1 style="text-align: center;">
<b>EriBERTa</b>
<br>
A Bilingual Pre-Trained Language Model
<br>
for Clinical Natural Language Processing
</h1>
<br>
<p style="text-align: justify;">
We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information.
Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.
</p>
- 📖 Paper: [EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing](https://arxiv.org/abs/2306.07373)
## How to Get Started with the Model
You can load the model using:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("HiTZ/EriBERTa-base")
model = AutoModelForMaskedLM.from_pretrained("HiTZ/EriBERTa-base")
```
# Model Description
- **Developed by**: Iker De la Iglesia, Aitziber Atutxa, Koldo Gojenola, and Ander Barrena
- **Contact**: [Iker De la Iglesia](mailto:iker.delaiglesia@ehu.eus) and [Ander Barrena](mailto:ander.barrena@ehu.eus)
- **Language(s) (NLP)**: English, Spanish
- **License**: apache-2.0
- **Funding**:
- The Spanish Ministry of Science and Innovation, MCIN/AEI/ 10.13039/501100011033/FEDER projects:
- Proyectos de Generación de Conocimiento 2022 (EDHIA PID2022-136522OB-C22)
- DOTT-HEALTH/PAT-MED PID2019-543106942RB-C3.
- EU NextGeneration EU/PRTR (DeepR3 TED2021-130295B-C31, ANTIDOTE PCI2020-120717-2 EU ERA-Net CHIST-ERA).
- Basque Government:
- IXA IT1570-22.
## Model Details
<table style="border: 1px; border-collapse: collapse;">
<caption>Pre-Training settings for EriBERTa-base.</caption>
<tbody>
<tr>
<td>Param. no.</td>
<td>~125M</td>
</tr>
<tr>
<td>Vocabulary size</td>
<td>64k</td>
</tr>
<tr>
<td>Sequence Length</td>
<td>512</td>
</tr>
<tr>
<td>Token/step</td>
<td>2M</td>
</tr>
<tr>
<td>Steps</td>
<td>125k</td>
</tr>
<tr>
<td>Total Tokens</td>
<td>4.5B</td>
</tr>
<tr>
<td>Scheduler</td>
<td>Linear with Warm-up</td>
</tr>
<tr>
<td>Peak LR</td>
<td>2.683e-4</td>
</tr>
<tr>
<td>Warm-up Steps</td>
<td>7.5k</td>
</tr>
</tbody>
</table>
## Training Data
<table style="border: 1px; border-collapse: collapse;">
<caption>Data sources and word counts by language.</caption>
<thead>
<tr>
<th>Language</th>
<th>Source</th>
<th>Words</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="4">English</td>
<td>ClinicalTrials</td>
<td>127.4M</td>
</tr>
<tr>
<td>EMEA</td>
<td>12M</td>
</tr>
<tr>
<td>PubMed</td>
<td>968.4M</td>
</tr>
<tr>
<td>MIMIC-III</td>
<td>206M</td>
</tr>
<tr>
<td rowspan="6">Spanish</td>
<td>EMEA</td>
<td>13.6M</td>
</tr>
<tr>
<td>PubMed</td>
<td>8.4M</td>
</tr>
<tr>
<td>Medical Crawler</td>
<td>918M</td>
</tr>
<tr>
<td>SPACC</td>
<td>350K</td>
</tr>
<tr>
<td>UFAL</td>
<td>10.5M</td>
</tr>
<tr>
<td>WikiMed</td>
<td>5.2M</td>
</tr>
</tbody>
</table>
# Limitation and Bias
<p>EriBERTa is currently optimized for masked language modeling to perform the Fill Mask task. While its potential for fine-tuning on downstream tasks such as Named Entity Recognition (NER) and Text Classification has been evaluated, it is recommended to validate and test the model for specific applications before deploying it in production to ensure its effectiveness and reliability.</p>
<p> Due to the scarcity of medical-clinical corpora, the EriBERTa model has been trained on a corpus gathered from multiple sources, including web crawling. Thus, the employed corpora may not encompass all possible linguistic and contextual variations present in clinical language. Consequently, the model may exhibit limitations when applied to specific clinical subdomains or rare medical conditions not well-represented in the training data.</p>
## Biases
<ul>
<li><strong>Data Collection Bias:</strong> The training data for EriBERTa was collected from various sources, some of them using web crawling techniques. This method may introduce biases related to the prevalence of certain types of content, perspectives, and language usage patterns. Consequently, the model might reflect and propagate these biases in its predictions.</li>
<li><strong>Demographic and Linguistic Bias:</strong> Given that the web-sourced corpus may not equally represent all demographic groups or linguistic nuances, the model may perform disproportionately well for certain populations while underperforming for others. This could lead to disparities in the quality of clinical data processing and information retrieval across different patient groups.</li>
<li><strong>Unexamined Ethical Considerations:</strong> As of now, no comprehensive measures have been taken to systematically evaluate the ethical implications and biases embedded in EriBERTa. While we are committed to addressing these issues, the current version of the model may inadvertently perpetuate existing biases and ethical concerns inherent in the data.</li>
</ul>
## Disclaimer
<p>EriBERTa has not been designed or developed to be used as a medical device. Any output should be verified by a Healthcare Professional, and no direct diagnosis should be claimed. The model's output may not always be completely reliable. Due to the nature of language models, predictions may be incorrect or biased.</p>
<p>We do not take any liability for the use of this model, and it should ideally be fine-tuned and tested before application. It must not be used as a medical tool or for any critical decision-making processes without thorough validation and supervision by qualified professionals.</p>
# Citing information
```bibtext
@misc{delaiglesia2023eriberta,
title={{EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing}},
author={Iker De la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander Barrena},
year={2023},
eprint={2306.07373},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
|