NesrineBannour
/

CAS-privacy-preserving-model

@@ -24,7 +24,7 @@ In this [paper](https://doi.org/10.1016/j.jbi.2022.104073), we propose a
 Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
 The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
 The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
-without sharing the data itself or exposing the *private teacher model* that was directlty built on this data.
 # CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
@@ -36,11 +36,11 @@ To generate the CAS Privacy-Preserving Mimic Model, we used a *private teacher m
 [silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
 This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
-We share only the weights of the CAS *student model*, which is trained on a silver-labeled publicly released data.
 We argue that no potential attack could reveal information about sensitive private data using the silver annotations
 generated by the *private teacher model* on publicly available non-sensitive data.
-Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language struturing ([NLstruct](https://github.com/percevalw/nlstruct)) library that
 implements NER models that handle nested entities.
 - **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
@@ -79,7 +79,7 @@ implements NER models that handle nested entities.
 ```
 ## 1. Load and use the model using only NLstruct
-NLstruct (https://github.com/percevalw/nlstruct) is the Python library we used to generate our
 CAS privacy-preserving NER mimic model and that handles nested entities.
 ### Install the NLstruct library
@@ -100,8 +100,8 @@ CAS privacy-preserving NER mimic model and that handles nested entities.
   # Export the predictions into the BRAT standoff format
   export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
 ```
-## 2. Load the model using NLstruct and use it with the medkit library
-Medkit (https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
 including textual data.
 ### Install the Medkit library
@@ -124,12 +124,15 @@ from medkit.core.text import NEROperation,Entity,Span,Segment, span_utils
 class CAS_matcher(NEROperation):
     def __init__(self):
-        # Load the model
         fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
-        urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
         model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
-        urllib.request.urlretrieve(model_url, "path/to/your/folder/"+ model_url.split('/')[-1])
-        path_checkpoint = "path/to/your/folder/"+ model_url.split('/')[-1]
         self.model = load_pretrained(path_checkpoint)
         self.model.eval()
@@ -191,7 +194,7 @@ for doc in docs:
        doc.anns.add(ent)
 brat_output_converter = BratOutputConverter(attrs=[])
 # To keep the same document names in the output folder
-doc_names = [os.path.basename(doc.metadata["path_to_text"]) for doc in docs]
 brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
 ```
@@ -232,7 +235,9 @@ and the CAS student model is overestimated.
 We thank the institutions and colleagues who made it possible to use the datasets described in this study:
 the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
 and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
-the ITMO Cancer Aviesan for funding our research.
 ## Citation

 Privacy-Preserving Mimic Models architecture that enables the generation of shareable models using the *mimic learning* approach.
 The idea of mimic learning is to annotate unlabeled public data through a *private teacher model* trained on the original sensitive data.
 The newly labeled public dataset is then used to train the *student models*. These generated *student models* could be shared
+without sharing the data itself or exposing the *private teacher model* that was directly built on this data.
 # CAS Privacy-Preserving Named Entity Recognition (NER) Mimic Model
 [silver annotations](https://zenodo.org/records/6451361), we train the CAS *student model*, namely the CAS Privacy-Preserving NER Mimic Model.
 This model might be viewed as a knowledge transfer process between the *teacher* and the *student model* in a privacy-preserving manner.
+We share only the weights of the CAS *student model*, which is trained on silver-labeled publicly released data.
 We argue that no potential attack could reveal information about sensitive private data using the silver annotations
 generated by the *private teacher model* on publicly available non-sensitive data.
+Our model is constructed based on [CamemBERT](https://huggingface.co/camembert) model using the Natural language structuring ([NLstruct](https://github.com/percevalw/nlstruct)) library that
 implements NER models that handle nested entities.
 - **Paper:** [Privacy-preserving mimic models for clinical named entity recognition in French](https://doi.org/10.1016/j.jbi.2022.104073)
 ```
 ## 1. Load and use the model using only NLstruct
+[NLstruct](https://github.com/percevalw/nlstruct) is the Python library we used to generate our
 CAS privacy-preserving NER mimic model and that handles nested entities.
 ### Install the NLstruct library
   # Export the predictions into the BRAT standoff format
   export_to_brat(test_predictions, filename_prefix="path/to/exported_brat")
 ```
+## 2. Load the model using NLstruct and use it with the Medkit library
+[Medkit](https://github.com/TeamHeka/medkit) is a Python library for facilitating the extraction of features from various modalities of patient data,
 including textual data.
 ### Install the Medkit library
 class CAS_matcher(NEROperation):
     def __init__(self):
+        # Load the fasttext file
         fasttext_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model_fasttext.txt")
+        if not os.path.exists("CAS-privacy-preserving-model_fasttext.txt"):
+            urllib.request.urlretrieve(fasttext_url, fasttext_url.split('/')[-1])
+        # Load the model
         model_url = hf_hub_url(repo_id="NesrineBannour/CAS-privacy-preserving-model", filename="CAS-privacy-preserving-model.ckpt")
+        if not os.path.exists("ner_model/CAS-privacy-preserving-model.ckpt"):
+            urllib.request.urlretrieve(model_url, "ner_model/"+ model_url.split('/')[-1])
+        path_checkpoint = "ner_model/"+ model_url.split('/')[-1]
         self.model = load_pretrained(path_checkpoint)
         self.model.eval()
        doc.anns.add(ent)
 brat_output_converter = BratOutputConverter(attrs=[])
 # To keep the same document names in the output folder
+doc_names = [os.path.splitext(os.path.basename(doc.metadata["path_to_text"]))[0] for doc in docs]
 brat_output_converter.save(docs, dir_path="Local/EMEA_out_brat", doc_names=doc_names)
 ```
 We thank the institutions and colleagues who made it possible to use the datasets described in this study:
 the Biomedical Informatics Department at the Rouen University Hospital provided access to the LERUDI corpus,
 and Dr. Grabar (Université de Lille, CNRS, STL) granted permission to use the DEFT/CAS corpus. We would also like to thank
+the ITMO Cancer Aviesan for funding our research, and the [HeKA research team](https://team.inria.fr/heka/) for integrating our model
+into their library [Medkit]((https://github.com/TeamHeka/medkit)).
 ## Citation