Quinten Datalab commited on
Commit
8c04e1c
1 Parent(s): db24c7e

Update README.md

Browse files

NER results for CAS dataset updated.

Files changed (1) hide show
  1. README.md +23 -10
README.md CHANGED
@@ -32,19 +32,21 @@ AliBERT: is a pre-trained language model for French biomedical text. It is train
32
  Here are the main contributions of our work:
33
  A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
34
  A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
35
- AliBERT outperforms other French PLMs in different downstream tasks. It is a foundation model that achieved state-of-the-art results on French biomedical text.
 
 
36
 
37
  # Data
38
  The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
39
 
40
  |Dataset name| Quantity| Size |
41
  |----|---|---|
42
- |Drug database| 23K| 550Mb |
43
- |RCP| 35K| 2200Mb|
44
- |Articles| 500K| 4300Mb |
45
- |Thesis| 300K|300Mb |
46
- |Cochrane| 7.6K| 27Mb|
47
-
48
 
49
  # How to use alibert-quinten/Oncology-NER with HuggingFace
50
 
@@ -86,6 +88,17 @@ nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée a
86
  The model has been evaluted in the following downstream tasks
87
 
88
  ## Biomedical Named Entity Recognition (NER)
89
-
90
- ##
91
- AliBERT: A Pre-trained Language Model for French Biomedical Text
 
 
 
 
 
 
 
 
 
 
 
 
32
  Here are the main contributions of our work:
33
  A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
34
  A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
35
+ It is a foundation model that achieved state-of-the-art results on French biomedical text.
36
+
37
+ The Paper can be found here: https://aclanthology.org/2023.bionlp-1.19/
38
 
39
  # Data
40
  The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
41
 
42
  |Dataset name| Quantity| Size |
43
  |----|---|---|
44
+ |Drug leaflets (Base de données publique des médicament)| 23K| 550Mb |
45
+ |RCP (a French equivalent of Physician’s Desk Reference)| 35K| 2200Mb|
46
+ |Articles (biomedical articles from ScienceDirect)| 500K| 4300Mb |
47
+ |Thesis (Thesis manuscripts in French)| 300K|300Mb |
48
+ |Cochrane (articles from Cochrane database)| 7.6K| 27Mb|
49
+ *Table 1: Pretraining dataset*
50
 
51
  # How to use alibert-quinten/Oncology-NER with HuggingFace
52
 
 
88
  The model has been evaluted in the following downstream tasks
89
 
90
  ## Biomedical Named Entity Recognition (NER)
91
+ The model is evaluated on two (CAS and QUAERO) publically available Frech biomedical text.
92
+ #### CAS dataset
93
+ |Models | CamemBERT| | | AliBERT | | | AliBERT-ELECTRA | | |
94
+ |:-----:|:--------:|:-:|:-:|:-------:|:-:|:-:|:---------------:|:-:|:-:|
95
+ |Entities| P | R | F1 | P | R | F1 | P | R | F1 |
96
+ |Substance| **0.96** | 0.87 | 0.91 | **0.96** | **0.91**| **0.93** | 0.95 | 0.91 |0.93|
97
+ |Symptom | 0.89 | 0.91 | 0.90 | **0.96** | **0.98** | **0.97**| 0.94 | **0.98** | 0.96|
98
+ |Anatomy | 0.94 | 0.91 | 0.88 | **0.97**| **0.97**| **0.98**| 0.96 | **0.97**| 0.96 |
99
+ |Value | 0.88 | 0.46 | 0.60 | **0.98**| **0.99**| **0.98**| 0.93 | 0.93 | 0.93|
100
+ |Pathology | 0.79 | **0.70**| **0.74**| **0.81**| 0.39 | 0.52 | 0.85 | 0.57 | 0.68|
101
+ |Macro Avg | 0.89 | 0.79 | 0.81 | **0.94**| 0.85 | 0.88 | 0.92 | **0.87**| **0.89**|
102
+ *Table 2: NER performances on CAS*
103
+
104
+ ##AliBERT: A Pre-trained Language Model for French Biomedical Text