File size: 8,082 Bytes
13d9326 c82f21e 26fe588 c82f21e 6778da3 c82f21e 13d9326 26fe588 db24c7e 26fe588 5e80cf1 8c04e1c 26fe588 6850efc 26fe588 db24c7e 26fe588 8c04e1c 191d0ac 6850efc 191d0ac 6850efc 191d0ac e919f92 191d0ac e919f92 191d0ac e919f92 191d0ac 6850efc 5e80cf1 e919f92 5e80cf1 e919f92 5e80cf1 e919f92 5e80cf1 6850efc 8c04e1c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
license: mit
language:
- fr
library_name: transformers
tags:
- Biomedical
- Medical
- French-Biomedical
Mask token:
- [MASK]
widget:
- text: "A l’admission, l’examen clinique mettait en évidence : - une hypotension artérielle avec une pression [MASK] à 6 mmHg."
example_title: "Example 1"
- text: "Le patient a été diagnostiqué avec une [MASK] lobaire aiguë et a été traité avec des antibiotiques appropriés"
example_title: "Example 2"
- text: "En mars 2001, le malade fut opéré, mais vu le caractère hémorragique de la tumeur, une simple biopsie surrénalienne a été réalisée ayant montré l’aspect de [MASK] malin non Hodgkinien de haut grade de malignité."
example_title: "Example 3"
- text: "La cytologie urinaire n’a mis en évidence que des cellules [MASK] normales et l’examen cyto-bactériologique des urines était stérile."
example_title: "Example 4"
- text: "La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle."
example_title: "Example 5"
---
# quinten-datalab/AliBERT-7GB: AliBERT: is a pre-trained language model for French biomedical text.
# Introduction
AliBERT: is a pre-trained language model for French biomedical text. It is trained with masked language model like RoBERTa.
Here are the main contributions of our work:
<ul>
<li>
A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
</li>
<li>
A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
</li>
<li>
It is a foundation model that achieved state-of-the-art results on French biomedical text.
</li>
</ul>
The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
# Data
The pre-training corpus was gathered from different sub-corpora. It is composed of 7GB French biomedical textual documents. The corpora were collected from different sources. Scientific articles are collected from ScienceDirect using an API provided on subscription and where French articles in biomedical domain were selected. The summaries of thesis manuscripts are collected from "Système universitaire de documentation (SuDoc)" which is a catalog of universities documentation system. Short texts and some complete sentences were collected from the public drug database which lists the characteristics of tens of thousands of drugs. Furthermore, a similar drug database known as "Résumé des Caractéristiques du Produit (RCP)" is also used to represent a description of medications that are intended to be utilized by biomedicine professionals.
# How to use alibert-quinten/Oncology-NER with HuggingFace
Load quinten-datalab/AliBERT-7GB fill-mask model and the tokenizer used to train AliBERT:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification,pipeline
tokenizer = AutoTokenizer.from_pretrained("quinten-datalab/AliBERT-7GB")
model = AutoModelForTokenMaskedLM.from_pretrained("quinten-datalab/AliBERT-7GB")
fill_mask=pipeline("fill-mask",model=model,tokenizer=tokenizer)
nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle.")
[{'score': 0.7724128365516663,
'token': 6749,
'token_str': 'cuisse',
'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la cuisse afin de limiter la plaie cicatricielle.'},
{'score': 0.09472355246543884,
'token': 4915,
'token_str': 'jambe',
'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la jambe afin de limiter la plaie cicatricielle.'},
{'score': 0.03340734913945198,
'token': 2050,
'token_str': 'main',
'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la main afin de limiter la plaie cicatricielle.'},
{'score': 0.030924487859010696,
'token': 844,
'token_str': 'face',
'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la face afin de limiter la plaie cicatricielle.'},
{'score': 0.012518334202468395,
'token': 3448,
'token_str': 'joue',
'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la joue afin de limiter la plaie cicatricielle.'}]
```
# Metrics and results
The model has been evaluted in the following downstream tasks
## Biomedical Named Entity Recognition (NER)
The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.
#### CAS dataset
<style type="text/css">
.tg {border-collapse:collapse;border-spacing:0;}
.tg .tg-baqh{text-align:center;vertical-align:top}
.tg .tg-0lax{text-align:center;vertical-align:top}
</style>
<table class="tg">
<thead>
<tr>
<th>Models</th>
<th class="tg-0lax" colspan="3">CamemBERT</th>
<th class="tg-0lax" colspan="3">AliBERT</th>
<th class="tg-0lax" colspan="3">DrBERT</th>
</tr>
</thead>
<tbody>
<tr>
<td>Entities</td><td>P<br></td><td>R</td><td>F1</td><td>P<br></td><td>R</td><td>F1</td><td>P<br></td><td>R</td><td>F1</td>
</tr>
<tr>
<td>Substance</td><td>0.96</td><td>0.87</td><td>0.91</td><td>0.96</td><td>0.91</td><td>0.93</td><td>0.83</td><td>0.83</td><td>0.82</td>
</tr>
<tr>
<td>Symptom</td> <td>0.89</td> <td>0.91</td> <td>0.90</td> <td>0.96</td> <td>0.98</td> <td>0.97</td> <td>0.93</td> <td>0.90</td> <td>0.91</td>
</tr>
<tr>
<td>Anatomy</td> <td>0.94</td> <td>0.91</td> <td>0.88</td> <td>0.97</td> <td>0.97</td> <td>0.98</td> <td>0.92</td> <td>0.93</td> <td>0.93</td>
</tr>
<tr>
<td>Value</td> <td>0.88</td> <td>0.46</td> <td>0.60</td> <td>0.98</td> <td>0.99</td> <td>0.98</td> <td>0.91</td> <td>0.91</td> <td>0.91</td>
</tr>
<tr>
<td> Pathology</td> <td>0.79</td> <td>0.70</td> <td>0.74</td> <td>0.81</td> <td>0.39</td> <td>0.52</td> <td>0.85 <td>0.57</td> <td>0.68</td>
</tr>
<tr>
<td>Macro Avg</td> <td>0.89 </td> <td>0.79</td> <td>0.81</td> <td> 0.94</td> <td>0.85</td> <td>0.88</td> <td> 0.92</td> <td> 0.87</td> <td>0.89</td>
</tr>
</tbody>
</table>
Table 1: NER performances on CAS dataset
#### QUAERO dataset
<table class="tg">
<thead>
<tr>
<th>Models</th>
<th class="tg-0lax" colspan="3">CamemBERT</th>
<th class="tg-0lax" colspan="3">AliBERT</th>
<th class="tg-0lax" colspan="3">DrBERT</th>
</tr>
</thead>
<tbody>
<tr>
<td>Entity </td> <td> P </td> <td> R </td> <td> F1 </td> <td> P </td> <td> R </td> <td> F1 </td> <td> P </td> <td> R </td> <td> F1 </td>
</tr>
<tr>
<td>Anatomy </td> <td> 0.649 </td> <td> 0.641 </td> <td> 0.645 </td> <td> 0.795 </td> <td> 0.811 </td> <td> 0.803 </td> <td> 0.736 </td> <td> 0.844 </td> <td> 0.824 </td>
</tr>
<tr>
<td>Chemical </td> <td> 0.844 </td> <td> 0.847 </td> <td> 0.846 </td> <td> 0.878 </td> <td> 0.893 </td> <td> 0.885 </td> <td> 0.505 </td> <td> 0.823 </td> <td> 0.777 </td>
</tr>
<tr>
<td>Device </td> <td> 0.000 </td> <td> 0.000 </td> <td> 0.000 </td> <td> 0.506 </td> <td> 0.356 </td> <td> 0.418 </td> <td> 0.939 </td> <td> 0.237 </td> <td> 0.419 </td>
</tr>
<tr>
<td>Disorder </td> <td> 0.772 </td> <td> 0.818 </td> <td> 0.794 </td> <td> 0.857 </td> <td> 0.843 </td> <td> 0.850 </td> <td> 0.883 </td> <td> 0.809 </td> <td> 0.845 </td>
</tr>
<tr>
<td>Procedure </td> <td> 0.880 </td> <td> 0.894 </td> <td> 0.887 </td> <td> 0.969 </td> <td> 0.967 </td> <td> 0.968 </td> <td> 0.944 </td> <td> 0.976 </td> <td> 0.960 </td>
</tr>
<tr>
<td>Macro Avg </td> <td> 0.655 </td> <td> 0.656 </td> <td> 0.655 </td> <td> 0.807 </td> <td> 0.783 </td> <td> 0.793 </td> <td> 0.818 </td> <td> 0.755 </td> <td> 0.782 </td>
</tr>
</tbody>
</table>
Table 2: NER performances on QUAERO dataset
##AliBERT: A Pre-trained Language Model for French Biomedical Text |