quinten-datalab
/

AliBERT-7GB

@@ -32,19 +32,21 @@ AliBERT: is a pre-trained language model for French biomedical text. It is train
 Here are the main contributions of our work:
   A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
   A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
-  AliBERT outperforms other French PLMs in different downstream tasks. It is a foundation model that achieved state-of-the-art results on French biomedical text.
 # Data
 The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
 |Dataset name| Quantity| Size |
 |----|---|---|
-|Drug database| 23K| 550Mb |
-|RCP| 35K| 2200Mb|
-|Articles| 500K| 4300Mb |
-|Thesis| 300K|300Mb |
-|Cochrane| 7.6K| 27Mb|
 # How to use alibert-quinten/Oncology-NER with HuggingFace
@@ -86,6 +88,17 @@ nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée a
 The model has been evaluted in the following downstream tasks
 ## Biomedical Named Entity Recognition (NER)
-##
-AliBERT: A Pre-trained Language Model for French Biomedical Text

 Here are the main contributions of our work:
   A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
   A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
+  It is a foundation model that achieved state-of-the-art results on French biomedical text.
+The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
 # Data
 The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
 |Dataset name| Quantity| Size |
 |----|---|---|
+|Drug leaflets (Base de données publique des médicament)| 23K| 550Mb |
+|RCP (a French equivalent of Physician’s Desk Reference)| 35K| 2200Mb|
+|Articles (biomedical articles from ScienceDirect)| 500K| 4300Mb |
+|Thesis (Thesis manuscripts in French)| 300K|300Mb |
+|Cochrane (articles from Cochrane database)| 7.6K| 27Mb|
+*Table 1: Pretraining dataset*
 # How to use alibert-quinten/Oncology-NER with HuggingFace
 The model has been evaluted in the following downstream tasks
 ## Biomedical Named Entity Recognition (NER)
+The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.
+#### CAS dataset
+|Models | CamemBERT|   |   | AliBERT |   |   | AliBERT-ELECTRA |   |   |
+|:-----:|:--------:|:-:|:-:|:-------:|:-:|:-:|:---------------:|:-:|:-:|
+|Entities| P | R |  F1  | P |   R  |  F1  | P  |  R  |   F1 |
+|Substance| **0.96** | 0.87 | 0.91 | **0.96** | **0.91**| **0.93** | 0.95  | 0.91 |0.93|
+|Symptom  | 0.89 | 0.91 | 0.90 | **0.96** | **0.98** | **0.97**| 0.94 | **0.98** | 0.96|
+|Anatomy | 0.94 | 0.91 | 0.88 | **0.97**| **0.97**| **0.98**| 0.96 | **0.97**| 0.96 |
+|Value | 0.88 | 0.46 | 0.60 | **0.98**| **0.99**| **0.98**| 0.93 | 0.93 |	0.93|
+|Pathology | 0.79 | **0.70**| **0.74**| **0.81**| 0.39 | 0.52 | 0.85	| 0.57 | 0.68|
+|Macro Avg | 0.89   | 0.79 | 0.81 |  **0.94**| 0.85 | 0.88 |  0.92 |  **0.87**| **0.89**|
+*Table 2: NER performances on CAS*
+##AliBERT: A Pre-trained Language Model for French Biomedical Text