quinten-datalab
/

AliBERT-7GB

@@ -5,7 +5,7 @@ language:
 library_name: transformers
 tags:
 - Biomedical
-- medical
 - French-Biomedical
 Mask token:
 - [MASK]
@@ -21,4 +21,70 @@ widget:
 - text: "La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle."
   example_title: "Example 5"
 ---
 AliBERT: A Pre-trained Language Model for French Biomedical Text

 library_name: transformers
 tags:
 - Biomedical
+- Medical
 - French-Biomedical
 Mask token:
 - [MASK]
 - text: "La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle."
   example_title: "Example 5"
 ---
+# quinten-datalab/AliBERT-7GB: AliBERT: is a pre-trained language model for French biomedical text. It is trained with masked language model like RoBERTa.
+# Introduction
+AliBERT: is a pre-trained language model for French biomedical text. It is trained with masked language model like RoBERTa.
+Here are the main contributions of our work:
+  A French biomedical language model, a language-specific and domain-specific PLM, which can be used to represent French biomedical text for different downstream tasks.
+  A normalization of a Unigram sub-word tokenization of French biomedical textual input which improves our vocabulary and overall performance of the models trained.
+  AliBERT outperforms other French PLMs in different downstream tasks. It is a foundation model that achieved state-of-the-art results on French biomedical text.
+# Data
+The pre-training corpus was gathered from different sub-corpora.It is composed of 7GB French biomedical textual documents. Here are the sources used.
+|Dataset name| Quantity| Size |
+|----|---|---|
+|Drug database| 23K| 550Mb |
+|RCP| 35K| 2200Mb|
+|Articles| 500K| 4300Mb |
+|Thesis| 300K|300Mb |
+|Cochrane| 7.6K| 27Mb|
+# How to use alibert-quinten/Oncology-NER with HuggingFace
+Load quinten-datalab/AliBERT-7GB fill-mask model and the tokenizer used to train AliBERT:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification,pipeline
+tokenizer = AutoTokenizer.from_pretrained("quinten-datalab/AliBERT-7GB")
+model = AutoModelForTokenMaskedLM.from_pretrained("quinten-datalab/AliBERT-7GB")
+fill_mask=pipeline("fill-mask",model=model,tokenizer=tokenizer)
+nlp_AliBERT=fill_mask("La prise de greffe a été systématiquement réalisée au niveau de la face interne de la [MASK] afin de limiter la plaie cicatricielle.")
+[{'score': 0.7724128365516663,
+  'token': 6749,
+  'token_str': 'cuisse',
+  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la cuisse afin de limiter la plaie cicatricielle.'},
+ {'score': 0.09472355246543884,
+  'token': 4915,
+  'token_str': 'jambe',
+  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la jambe afin de limiter la plaie cicatricielle.'},
+ {'score': 0.03340734913945198,
+  'token': 2050,
+  'token_str': 'main',
+  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la main afin de limiter la plaie cicatricielle.'},
+ {'score': 0.030924487859010696,
+  'token': 844,
+  'token_str': 'face',
+  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la face afin de limiter la plaie cicatricielle.'},
+ {'score': 0.012518334202468395,
+  'token': 3448,
+  'token_str': 'joue',
+  'sequence': 'La prise de greffe a été systématiquement réalisée au niveau de la face interne de la joue afin de limiter la plaie cicatricielle.'}]
+```
+## Metrics and results
+The model has been evaluted in the following downstream tasks
+## Biomedical Named Entity Recognition (NER)
+##
 AliBERT: A Pre-trained Language Model for French Biomedical Text