---
license: cc-by-sa-4.0
datasets:
- bigbio/cas
language:
- fr
metrics:
- f1
- precision
- recall
library_name: transformers
tags:
- biomedical
- clinical
- pytorch
- camembert
pipeline_tag: token-classification
inference: false
---
# Privacy-preserving mimic models for clinical named entity recognition in French
In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a
Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
without sharing the data itself or exposing the *private teacher model* that was directly built on this data.
# CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher model* to annotate the unlabeled
[CAS clinical French corpus](https://aclanthology.org/W18-5614/). The *private teacher model* is an NER model trained on the
[MERLOT clinical corpus](https://link.springer.com/article/10.1007/s10579-017-9382-y) and could not be shared. Using the produced
[silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data.
We argue that no potential attack could reveal information about sensitive private data using the silver annotations
generated by the *private teacher model* on publicly available non-sensitive data.
Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that
implements NER models that handle nested entities.
- **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
- **Produced gold and silver annotations for the [DEFT](https://deft.lisn.upsaclay.fr/2020/) and [CAS](https://aclanthology.org/W18-5614/) French clinical corpora:** https://zenodo.org/records/6451361
- **Developed by:** [Nesrine Bannour](https://github.com/NesrineBannour), [Perceval Wajsbürt](https://github.com/percevalw), [Bastien Rance](https://team.inria.fr/heka/fr/team-members/rance/), [Xavier Tannier](http://xavier.tannier.free.fr/) and [Aurélie Névéol](https://perso.limsi.fr/neveol/)
- **Language:** French
- **License:** cc-by-sa-4.0
# Download the CAS Privacy-Preserving NER Mimic Model
```python
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]
```
## 1. Load and use the model using only NLstruct
[NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our
CAS privacy-preserving NER mimic model and that handles nested entities.
### Install the NLstruct library
```
pip install nlstruct==0.1.0
```
### Use the model
```python
from nlstruct import load_pretrained
from nlstruct.datasets import load_from_brat, export_to_brat
ner_model = load_pretrained(path_checkpoint)
test_data = load_from_brat("path/to/brat/test")
test_predictions = ner_model.predict(test_data)
# Export the predictions into the BRAT standoff format
export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
```
## 2. Load the model using NLstruct and use it with the Medkit library
[Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
including textual data.
### Install the Medkit library
```
python -m pip install 'medkit-lib'
```
### Use the model
Our model could be implemented as a Medkit operation module as follows:
```python
import os
from nlstruct import load_pretrained
import urllib.request
from huggingface_hub import hf_hub_url
from medkit.io.brat import BratInputConverter, BratOutputConverter
from medkit.core import Attribute
from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils
class CAS_matcher(NEROperation):
def __init__(self):
# Load the fasttext file
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
# Load the model
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
self.model = load_pretrained(path_checkpoint)
self.model.eval()
def run(self, segments):
"""Return entities for each match in `segments`.
Parameters
----------
segments:
List of segments into which to look for matches.
Returns
-------
List[Entity]
Entities found in `segments`.
"""
# get an iterator to all matches, grouped by segment
entities = []
for segment in segments:
matches = self.model.predict({"doc_id":segment.uid,"text":segment.text})
entities.extend([entity
for entity in self._matches_to_entities(matches, segment)
])
return entities
def _matches_to_entities(self, matches, segment: Segment):
for match in matches["entities"]:
text_all,spans_all = [],[]
for fragment in match["fragments"]:
text, spans = span_utils.extract(
segment.text, segment.spans, [(fragment["begin"], fragment["end"])]
)
text_all.append(text)
spans_all.extend(spans)
text_all = "".join(text_all)
entity = Entity(
label=match["label"],
text=text_all,
spans=spans_all,
)
score_attr = Attribute(
label="confidence",
value=float(match["confidence"]),
#metadata=dict(model=self.model.path_checkpoint),
)
entity.attrs.add(score_attr)
yield entity
brat_converter = BratInputConverter()
docs = brat_converter.load("path/to/brat/test")
matcher = CAS_matcher()
for doc in docs:
entities = matcher.run([doc.raw_segment])
for ent in entities:
doc.anns.add(ent)
brat_output_converter = BratOutputConverter(attrs=[])
# To keep the same document names in the output folder
doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
brat_output_converter.save(docs, dir_path="path/to/exported_brat, doc_names=doc_names)
```
## Environmental Impact
Carbon emissions are estimated using the [Carbontracker](https://github.com/lfwa/carbontracker) tool.
The used version at the time of our experiments computes its estimates by using the average carbon intensity in
European Union in 2017 instead of the France value (294.21 gCO2eq/kWh vs. 85 gCO2eq/kWh).
Therefore, our reported carbon footprint of training both the private model that generated the silver annotations
and the CAS student model is overestimated.
- **Hardware Type:** GPU NVIDIA GTX 1080 Ti
- **Compute Region:** Gif-sur-Yvette, Île-de-France, France
- **Carbon Emitted:** 292 gCO2eq
## Acknowledgements
We thank the institutions and colleagues who made it possible to use the datasets described in this study:
the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model
into their library [Medkit]((https://github.com/TeamHeka/medkit)).
## Citation
If you use this model in your research, please make sure to cite our paper:
```bibtex
@article{BANNOUR2022104073,
title = {Privacy-preserving mimic models for clinical named entity recognition in French},
journal = {Journal of Biomedical Informatics},
volume = {130},
pages = {104073},
year = {2022},
issn = {1532-0464},
doi = {https://doi.org/10.1016/j.jbi.2022.104073},
url = {https://www.sciencedirect.com/science/article/pii/S1532046422000892}}
}
```