NesrineBannour commited on
Commit
67eeffe
1 Parent(s): 374ad40

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -12
README.md CHANGED
@@ -24,7 +24,7 @@ In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a
24
  Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
25
  The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
26
  The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
27
- without sharing the data itself or exposing the *private teacher model* that was directlty built on this data.
28
 
29
  # CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
30
 
@@ -36,11 +36,11 @@ To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher m
36
  [silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
37
  This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
38
 
39
- We share only the weights of the CAS *student model*, which is trained on a silver-labeled publicly released data.
40
  We argue that no potential attack could reveal information about sensitive private data using the silver annotations
41
  generated by the *private teacher model* on publicly available non-sensitive data.
42
 
43
- Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language struturing ([NLstruct](https://github.com/percevalw/nlstruct)) library that
44
  implements NER models that handle nested entities.
45
 
46
  - **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
@@ -79,7 +79,7 @@ implements NER models that handle nested entities.
79
  ```
80
 
81
  ## 1. Load and use the model using only NLstruct
82
- NLstruct (https://github.com/percevalw/nlstruct) is the Python library we used to generate our
83
  CAS privacy-preserving NER mimic model and that handles nested entities.
84
 
85
  ### Install the NLstruct library
@@ -100,8 +100,8 @@ CAS privacy-preserving NER mimic model and that handles nested entities.
100
  # Export the predictions into the BRAT standoff format
101
  export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
102
  ```
103
- ## 2. Load the model using NLstruct and use it with the medkit library
104
- Medkit (https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
105
  including textual data.
106
 
107
  ### Install the Medkit library
@@ -124,12 +124,15 @@ from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils
124
 
125
  class CAS_matcher(NEROperation):
126
  def __init__(self):
127
- # Load the model
128
  fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
129
- urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
 
 
130
  model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
131
- urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
132
- path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]
 
133
 
134
  self.model = load_pretrained(path_checkpoint)
135
  self.model.eval()
@@ -191,7 +194,7 @@ for doc in docs:
191
  doc.anns.add(ent)
192
  brat_output_converter = BratOutputConverter(attrs=[])
193
  # To keep the same document names in the output folder
194
- doc_names = [os.path.basename(doc.metadata["path_to_text"]) for doc in docs]
195
  brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
196
  ```
197
 
@@ -232,7 +235,9 @@ and the CAS student model is overestimated.
232
  We thank the institutions and colleagues who made it possible to use the datasets described in this study:
233
  the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
234
  and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
235
- the ITMO Cancer Aviesan for funding our research.
 
 
236
 
237
 
238
  ## Citation
 
24
  Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
25
  The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
26
  The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
27
+ without sharing the data itself or exposing the *private teacher model* that was directly built on this data.
28
 
29
  # CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
30
 
 
36
  [silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
37
  This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
38
 
39
+ We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data.
40
  We argue that no potential attack could reveal information about sensitive private data using the silver annotations
41
  generated by the *private teacher model* on publicly available non-sensitive data.
42
 
43
+ Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that
44
  implements NER models that handle nested entities.
45
 
46
  - **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
 
79
  ```
80
 
81
  ## 1. Load and use the model using only NLstruct
82
+ [NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our
83
  CAS privacy-preserving NER mimic model and that handles nested entities.
84
 
85
  ### Install the NLstruct library
 
100
  # Export the predictions into the BRAT standoff format
101
  export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
102
  ```
103
+ ## 2. Load the model using NLstruct and use it with the Medkit library
104
+ [Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
105
  including textual data.
106
 
107
  ### Install the Medkit library
 
124
 
125
  class CAS_matcher(NEROperation):
126
  def __init__(self):
127
+ # Load the fasttext file
128
  fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
129
+ if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
130
+ urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
131
+ # Load the model
132
  model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
133
+ if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
134
+ urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
135
+ path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
136
 
137
  self.model = load_pretrained(path_checkpoint)
138
  self.model.eval()
 
194
  doc.anns.add(ent)
195
  brat_output_converter = BratOutputConverter(attrs=[])
196
  # To keep the same document names in the output folder
197
+ doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
198
  brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
199
  ```
200
 
 
235
  We thank the institutions and colleagues who made it possible to use the datasets described in this study:
236
  the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
237
  and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
238
+ the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model
239
+ into their library [Medkit]((https://github.com/TeamHeka/medkit)).
240
+
241
 
242
 
243
  ## Citation