Quinten Datalab
commited on
Commit
•
8c04e1c
1
Parent(s):
db24c7e
Update README.md
Browse filesNER results for CAS dataset updated.
README.md
CHANGED
@@ -32,19 +32,21 @@ AliBERT: is a pre-trained language model for French biomedical text. It is train
|
|
32 |
Here are the main contributions of our work:
|
33 |
A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
|
34 |
A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
|
35 |
-
|
|
|
|
|
36 |
|
37 |
# Data
|
38 |
The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
|
39 |
|
40 |
|Dataset name| Quantity| Size |
|
41 |
|----|---|---|
|
42 |
-
|Drug
|
43 |
-
|RCP| 35K| 2200Mb|
|
44 |
-
|Articles| 500K| 4300Mb |
|
45 |
-
|Thesis| 300K|300Mb |
|
46 |
-
|Cochrane| 7.6K| 27Mb|
|
47 |
-
|
48 |
|
49 |
# How to use alibert-quinten/Oncology-NER with HuggingFace
|
50 |
|
@@ -86,6 +88,17 @@ nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée a
|
|
86 |
The model has been evaluted in the following downstream tasks
|
87 |
|
88 |
## Biomedical Named Entity Recognition (NER)
|
89 |
-
|
90 |
-
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
Here are the main contributions of our work:
|
33 |
A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
|
34 |
A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
|
35 |
+
It is a foundation model that achieved state-of-the-art results on French biomedical text.
|
36 |
+
|
37 |
+
The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
|
38 |
|
39 |
# Data
|
40 |
The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
|
41 |
|
42 |
|Dataset name| Quantity| Size |
|
43 |
|----|---|---|
|
44 |
+
|Drug leaflets (Base de données publique des médicament)| 23K| 550Mb |
|
45 |
+
|RCP (a French equivalent of Physician’s Desk Reference)| 35K| 2200Mb|
|
46 |
+
|Articles (biomedical articles from ScienceDirect)| 500K| 4300Mb |
|
47 |
+
|Thesis (Thesis manuscripts in French)| 300K|300Mb |
|
48 |
+
|Cochrane (articles from Cochrane database)| 7.6K| 27Mb|
|
49 |
+
*Table 1: Pretraining dataset*
|
50 |
|
51 |
# How to use alibert-quinten/Oncology-NER with HuggingFace
|
52 |
|
|
|
88 |
The model has been evaluted in the following downstream tasks
|
89 |
|
90 |
## Biomedical Named Entity Recognition (NER)
|
91 |
+
The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.
|
92 |
+
#### CAS dataset
|
93 |
+
|Models | CamemBERT| | | AliBERT | | | AliBERT-ELECTRA | | |
|
94 |
+
|:-----:|:--------:|:-:|:-:|:-------:|:-:|:-:|:---------------:|:-:|:-:|
|
95 |
+
|Entities| P | R | F1 | P | R | F1 | P | R | F1 |
|
96 |
+
|Substance| **0.96** | 0.87 | 0.91 | **0.96** | **0.91**| **0.93** | 0.95 | 0.91 |0.93|
|
97 |
+
|Symptom | 0.89 | 0.91 | 0.90 | **0.96** | **0.98** | **0.97**| 0.94 | **0.98** | 0.96|
|
98 |
+
|Anatomy | 0.94 | 0.91 | 0.88 | **0.97**| **0.97**| **0.98**| 0.96 | **0.97**| 0.96 |
|
99 |
+
|Value | 0.88 | 0.46 | 0.60 | **0.98**| **0.99**| **0.98**| 0.93 | 0.93 | 0.93|
|
100 |
+
|Pathology | 0.79 | **0.70**| **0.74**| **0.81**| 0.39 | 0.52 | 0.85 | 0.57 | 0.68|
|
101 |
+
|Macro Avg | 0.89 | 0.79 | 0.81 | **0.94**| 0.85 | 0.88 | 0.92 | **0.87**| **0.89**|
|
102 |
+
*Table 2: NER performances on CAS*
|
103 |
+
|
104 |
+
##AliBERT: A Pre-trained Language Model for French Biomedical Text
|