NesrineBannour
commited on
Commit
•
67eeffe
1
Parent(s):
374ad40
Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a
|
|
24 |
Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
|
25 |
The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
|
26 |
The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
|
27 |
-
without sharing the data itself or exposing the *private teacher model* that was
|
28 |
|
29 |
# CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
|
30 |
|
@@ -36,11 +36,11 @@ To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher m
|
|
36 |
[silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
|
37 |
This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
|
38 |
|
39 |
-
We share only the weights of the CAS *student model*, which is trained on
|
40 |
We argue that no potential attack could reveal information about sensitive private data using the silver annotations
|
41 |
generated by the *private teacher model* on publicly available non-sensitive data.
|
42 |
|
43 |
-
Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language
|
44 |
implements NER models that handle nested entities.
|
45 |
|
46 |
- **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
|
@@ -79,7 +79,7 @@ implements NER models that handle nested entities.
|
|
79 |
```
|
80 |
|
81 |
## 1. Load and use the model using only NLstruct
|
82 |
-
NLstruct
|
83 |
CAS privacy-preserving NER mimic model and that handles nested entities.
|
84 |
|
85 |
### Install the NLstruct library
|
@@ -100,8 +100,8 @@ CAS privacy-preserving NER mimic model and that handles nested entities.
|
|
100 |
# Export the predictions into the BRAT standoff format
|
101 |
export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
|
102 |
```
|
103 |
-
## 2. Load the model using NLstruct and use it with the
|
104 |
-
Medkit
|
105 |
including textual data.
|
106 |
|
107 |
### Install the Medkit library
|
@@ -124,12 +124,15 @@ from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils
|
|
124 |
|
125 |
class CAS_matcher(NEROperation):
|
126 |
def __init__(self):
|
127 |
-
# Load the
|
128 |
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
|
129 |
-
|
|
|
|
|
130 |
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
|
131 |
-
|
132 |
-
|
|
|
133 |
|
134 |
self.model = load_pretrained(path_checkpoint)
|
135 |
self.model.eval()
|
@@ -191,7 +194,7 @@ for doc in docs:
|
|
191 |
doc.anns.add(ent)
|
192 |
brat_output_converter = BratOutputConverter(attrs=[])
|
193 |
# To keep the same document names in the output folder
|
194 |
-
doc_names = [os.path.basename(doc.metadata["path_to_text"]) for doc in docs]
|
195 |
brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
|
196 |
```
|
197 |
|
@@ -232,7 +235,9 @@ and the CAS student model is overestimated.
|
|
232 |
We thank the institutions and colleagues who made it possible to use the datasets described in this study:
|
233 |
the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
|
234 |
and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
|
235 |
-
the ITMO Cancer Aviesan for funding our research.
|
|
|
|
|
236 |
|
237 |
|
238 |
## Citation
|
|
|
24 |
Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
|
25 |
The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
|
26 |
The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
|
27 |
+
without sharing the data itself or exposing the *private teacher model* that was directly built on this data.
|
28 |
|
29 |
# CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
|
30 |
|
|
|
36 |
[silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
|
37 |
This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
|
38 |
|
39 |
+
We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data.
|
40 |
We argue that no potential attack could reveal information about sensitive private data using the silver annotations
|
41 |
generated by the *private teacher model* on publicly available non-sensitive data.
|
42 |
|
43 |
+
Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that
|
44 |
implements NER models that handle nested entities.
|
45 |
|
46 |
- **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
|
|
|
79 |
```
|
80 |
|
81 |
## 1. Load and use the model using only NLstruct
|
82 |
+
[NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our
|
83 |
CAS privacy-preserving NER mimic model and that handles nested entities.
|
84 |
|
85 |
### Install the NLstruct library
|
|
|
100 |
# Export the predictions into the BRAT standoff format
|
101 |
export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
|
102 |
```
|
103 |
+
## 2. Load the model using NLstruct and use it with the Medkit library
|
104 |
+
[Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
|
105 |
including textual data.
|
106 |
|
107 |
### Install the Medkit library
|
|
|
124 |
|
125 |
class CAS_matcher(NEROperation):
|
126 |
def __init__(self):
|
127 |
+
# Load the fasttext file
|
128 |
fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
|
129 |
+
if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
|
130 |
+
urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
|
131 |
+
# Load the model
|
132 |
model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
|
133 |
+
if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
|
134 |
+
urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
|
135 |
+
path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
|
136 |
|
137 |
self.model = load_pretrained(path_checkpoint)
|
138 |
self.model.eval()
|
|
|
194 |
doc.anns.add(ent)
|
195 |
brat_output_converter = BratOutputConverter(attrs=[])
|
196 |
# To keep the same document names in the output folder
|
197 |
+
doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
|
198 |
brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
|
199 |
```
|
200 |
|
|
|
235 |
We thank the institutions and colleagues who made it possible to use the datasets described in this study:
|
236 |
the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
|
237 |
and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
|
238 |
+
the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model
|
239 |
+
into their library [Medkit]((https://github.com/TeamHeka/medkit)).
|
240 |
+
|
241 |
|
242 |
|
243 |
## Citation
|