File size: 7,967 Bytes

c969bfb

---
license: apache-2.0
language:
  - en
  - es
datasets:
  - HiTZ/Multilingual-Medical-Corpus
tags:
  - biomedical
  - medical
  - clinical
  - spanish
  - multilingual
widget:
- text: "Las radiologías óseas de cuerpo entero no detectan alteraciones <mask>, ni alteraciones vertebrales."
- text: "En el <mask> toraco-abdómino-pélvico no se encontraron hallazgos patológicos de interés."
- text: "Percutaneous transluminal coronary <mask> of both the LAD and Cx was done."
- text: "Clinical examination showed a non-pulsatile, painless, axial, irreducible exophthalmia  with no sign of <mask> or keratitis, and right monocular blindness, right ptosis."
---

<br>
<div style="text-align: center;">
    <img src="https://huggingface.co/HiTZ/EriBERTa-base/resolve/main/eriberta_icon.png" style="height: 175px;display: block;margin-left: auto;margin-right: auto;">
</div>

<h1 style="text-align: center;">
  <b>EriBERTa</b>
<br>
  A Bilingual Pre-Trained Language Model 
<br>
  for Clinical Natural Language Processing
</h1>

<br>

<p style="text-align: justify;">
We introduce EriBERTa, a bilingual domain-specific language model pre-trained on extensive medical and clinical corpora. We demonstrate that EriBERTa outperforms previous Spanish language models in the clinical domain, showcasing its superior capabilities in understanding medical texts and extracting meaningful information.
Moreover, EriBERTa exhibits promising transfer learning abilities, allowing for knowledge transfer from one language to another. This aspect is particularly beneficial given the scarcity of Spanish clinical data.
</p>

- 📖 Paper: [EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing](https://arxiv.org/abs/2306.07373)


## How to Get Started with the Model

You can load the model using:
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HiTZ/EriBERTa-base")
model = AutoModelForMaskedLM.from_pretrained("HiTZ/EriBERTa-base")
```


# Model Description

- **Developed by**: Iker De la Iglesia, Aitziber Atutxa, Koldo Gojenola, and Ander Barrena
- **Contact**: [Iker De la Iglesia](mailto:iker.delaiglesia@ehu.eus) and [Ander Barrena](mailto:ander.barrena@ehu.eus)
- **Language(s) (NLP)**: English, Spanish
- **License**: apache-2.0
- **Funding**: 
  - The Spanish Ministry of Science and Innovation, MCIN/AEI/ 10.13039/501100011033/FEDER projects:
    - Proyectos de Generación de Conocimiento 2022 (EDHIA PID2022-136522OB-C22) 
    - DOTT-HEALTH/PAT-MED PID2019-543106942RB-C31.
    - EU NextGeneration EU/PRTR (DeepR3 TED2021-130295B-C31, ANTIDOTE PCI2020-120717-2 EU ERA-Net CHIST-ERA).
  - Basque Government: 
    - IXA IT1570-22.


## Model Details

<table style="border: 1px; border-collapse: collapse;">
    <caption>Pre-Training settings for EriBERTa-base.</caption>
    <tbody>
        <tr>
            <td>Param. no.</td>
            <td>~125M</td>
        </tr>
        <tr>
            <td>Vocabulary size</td>
            <td>64k</td>
        </tr>
        <tr>
            <td>Sequence Length</td>
            <td>512</td>
        </tr>
        <tr>
            <td>Token/step</td>
            <td>2M</td>
        </tr>
        <tr>
            <td>Steps</td>
            <td>125k</td>
        </tr>
        <tr>
            <td>Total Tokens</td>
            <td>4.5B</td>
        </tr>
        <tr>
            <td>Scheduler</td>
            <td>Linear with Warm-up</td>
        </tr>
        <tr>
            <td>Peak LR</td>
            <td>2.683e-4</td>
        </tr>
        <tr>
            <td>Warm-up Steps</td>
            <td>7.5k</td>
        </tr>
    </tbody>
</table>





## Training Data

<table style="border: 1px; border-collapse: collapse;">
    <caption>Data sources and word counts by language.</caption>
    <thead>
        <tr>
            <th>Language</th>
            <th>Source</th>
            <th>Words</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td rowspan="4">English</td>
            <td>ClinicalTrials</td>
            <td>127.4M</td>
        </tr>
        <tr>
            <td>EMEA</td>
            <td>12M</td>
        </tr>
        <tr>
            <td>PubMed</td>
            <td>968.4M</td>
        </tr>
        <tr>
            <td>MIMIC-III</td>
            <td>206M</td>
        </tr>
        <tr>
            <td rowspan="6">Spanish</td>
            <td>EMEA</td>
            <td>13.6M</td>
        </tr>
        <tr>
            <td>PubMed</td>
            <td>8.4M</td>
        </tr>
        <tr>
            <td>Medical Crawler</td>
            <td>918M</td>
        </tr>
        <tr>
            <td>SPACC</td>
            <td>350K</td>
        </tr>
        <tr>
            <td>UFAL</td>
            <td>10.5M</td>
        </tr>
        <tr>
            <td>WikiMed</td>
            <td>5.2M</td>
        </tr>
    </tbody>
</table>

# Limitation and Bias

<p>EriBERTa is currently optimized for masked language modeling to perform the Fill Mask task. While its potential for fine-tuning on downstream tasks such as Named Entity Recognition (NER) and Text Classification has been evaluated, it is recommended to validate and test the model for specific applications before deploying it in production to ensure its effectiveness and reliability.</p>

<p> Due to the scarcity of medical-clinical corpora, the EriBERTa model has been trained on a corpus gathered from multiple sources, including web crawling. Thus, the employed corpora may not encompass all possible linguistic and contextual variations present in clinical language. Consequently, the model may exhibit limitations when applied to specific clinical subdomains or rare medical conditions not well-represented in the training data.</p>

## Biases

<ul>
    <li><strong>Data Collection Bias:</strong> The training data for EriBERTa was collected from various sources, some of them using web crawling techniques. This method may introduce biases related to the prevalence of certain types of content, perspectives, and language usage patterns. Consequently, the model might reflect and propagate these biases in its predictions.</li>
    <li><strong>Demographic and Linguistic Bias:</strong> Given that the web-sourced corpus may not equally represent all demographic groups or linguistic nuances, the model may perform disproportionately well for certain populations while underperforming for others. This could lead to disparities in the quality of clinical data processing and information retrieval across different patient groups.</li>
    <li><strong>Unexamined Ethical Considerations:</strong> As of now, no comprehensive measures have been taken to systematically evaluate the ethical implications and biases embedded in EriBERTa. While we are committed to addressing these issues, the current version of the model may inadvertently perpetuate existing biases and ethical concerns inherent in the data.</li>
</ul>

## Disclaimer
<p>EriBERTa has not been designed or developed to be used as a medical device. Any output should be verified by a Healthcare Professional, and no direct diagnosis should be claimed. The model's output may not always be completely reliable. Due to the nature of language models, predictions may be incorrect or biased.</p>
<p>We do not take any liability for the use of this model, and it should ideally be fine-tuned and tested before application. It must not be used as a medical tool or for any critical decision-making processes without thorough validation and supervision by qualified professionals.</p>


# Citing information

```bibtext
@misc{delaiglesia2023eriberta,
      title={{EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing}}, 
      author={Iker De la Iglesia and Aitziber Atutxa and Koldo Gojenola and Ander Barrena},
      year={2023},
      eprint={2306.07373},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```