File size: 5,045 Bytes
e5844c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
---
license: mit
language:
- fr
metrics:
- seqeval
library_name: transformers
pipeline_tag: token-classification
tags:
- medical
- biomedical
- medkit-lib
widget:
- text: >-
La radiographie et la tomodensitométrie ont montré des micronodules diffus
example_title: example 1
- text: >-
Elle souffre d'asthme mais n'a pas besoin d'Allegra
example_title: example 2
---
# DrBERT-CASM2
## Model description
**DrBERT-CASM2** is a French Named Entity Recognition model that was fine-tuned from
[DrBERT](https://huggingface.co/Dr-BERT/DrBERT-4GB-CP-PubMedBERT): A PreTrained model in French for biomedical and clinical domains.
It has been trained to detect the following type of entities: **problem**, **treatment** and **test** using the medkit Trainer.
- **Fine-tuned using** medkit [GitHub Repo](https://github.com/TeamHeka/medkit)
- **Developed by** @camila-ud, medkit, HeKA Research team
- **Dataset source**
Annotated version from @aneuraz called 'corpusCasM2: A corpus of annotated clinical texts'
- The annotation was performed collaborativelly by the students of masters students from Université Paris Cité.
- The corpus contains documents from CAS:
```
Natalia Grabar, Vincent Claveau, and Clément Dalloux. 2018. CAS: French Corpus with Clinical Cases.
In Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis,
pages 122–128, Brussels, Belgium. Association for Computational Linguistics.
```
# Intended uses & limitations
## Limitations and bias
This model was trained for **development and test phases**.
This model is limited by its training dataset, and it should be used with caution.
The results are not guaranteed, and the model should be used only in data exploration stages.
The model may be able to detect entities in the early stages of the analysis of medical documents in French.
The maximum token size was reduced to **128 tokens** to minimize training time.
# How to use
## Install medkit
First of all, please install medkit with the following command:
```
pip install 'medkit-lib[optional]'
```
Please check the [documentation](https://medkit.readthedocs.io/en/latest/user_guide/install.html) for more info and examples.
## Using the model
```python
from medkit.core.text import TextDocument
from medkit.text.ner.hf_entity_matcher import HFEntityMatcher
matcher = HFEntityMatcher(model="medkit/DrBERT-CASM2")
test_doc = TextDocument("Elle souffre d'asthme mais n'a pas besoin d'Allegra")
detected_entities = matcher.run([test_doc.raw_segment])
# show information
msg = "|".join(f"'{entity.label}':{entity.text}" for entity in detected_entities)
print(f"Text: '{test_doc.text}'\n{msg}")
```
```
Text: "Elle souffre d'asthme mais n'a pas besoin d'Allegra"
'problem':asthme|'treatment':Allegra
```
# Training data
This model was fine-tuned on **CASM2**, an internal corpus with clinical cases (in french) annotated by master students.
The corpus contains more than 5000 medkit documents (~ phrases) with entities to detect.
**Number of documents (~ phrases) by split**
| Split | # medkit docs |
| ---------- | ------------- |
| Train | 5824 |
| Validation | 1457 |
| Test | 1821 |
**Number of examples per entity type**
| Split | treatment | test | problem |
| ---------- | --------- | ---- | ------- |
| Train | 3258 | 3990 | 6808 |
| Validation | 842 | 1007 | 1745 |
| Test | 994 | 1289 | 2113 |
## Training procedure
This model was fine-tuned using the medkit trainer on CPU, it takes about 3h.
# Model perfomances
Model performances computes on CASM2 test dataset (using medkit seqeval evaluator)
Entity|precision|recall|f1
-|-|-|-
treatment|0.7492|0.7666|0.7578
test|0.7449|0.8240|0.7824
problem|0.6884|0.7304|0.7088
Overall|0.7188|0.7660|0.7416
## How to evaluate using medkit
```python
from medkit.text.metrics.ner import SeqEvalEvaluator
# load the matcher and get predicted entities by document
matcher = HFEntityMatcher(model="medkit/DrBERT-CASM2")
predicted_entities = [matcher.run([doc.raw_segment]) for doc in test_documents]
evaluator = SeqEvalEvaluator(tagging_scheme="iob2")
evaluator.compute(test_documents,predicted_entities=predicted_entities)
```
You can use the tokenizer from HF to evaluate by tokens instead of characters
```python
from transformers import AutoTokenizer
tokenizer_drbert = AutoTokenizer.from_pretrained("medkit/DrBERT-CASM2", use_fast=True)
evaluator = SeqEvalEvaluator(tokenizer=tokenizer_drbert,tagging_scheme="iob2")
evaluator.compute(test_documents,predicted_entities=predicted_entities)
```
# Citation
```
@online{medkit-lib,
author={HeKA Research Team},
title={medkit, A Python library for a learning health system},
url={https://pypi.org/project/medkit-lib/},
urldate = {2023-07-24},
}
```
```
HeKA Research Team, “medkit, a Python library for a learning health system.” https://pypi.org/project/medkit-lib/ (accessed Jul. 24, 2023).
``` |