Update README.md
Browse files
README.md
CHANGED
@@ -33,142 +33,6 @@ The final solution could be integrated into various systems and enhance privacy
|
|
33 |
including client support, legal, and general data anonymization tools. Success in this project will contribute to
|
34 |
scaling privacy-conscious AI systems without compromising the UX or operational performance.
|
35 |
|
36 |
-
## Getting Started
|
37 |
-
|
38 |
-
Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
|
39 |
-
|
40 |
-
```bash
|
41 |
-
cp .env.example .env
|
42 |
-
```
|
43 |
-
|
44 |
-
### Install the dependencies
|
45 |
-
|
46 |
-
```bash
|
47 |
-
pip install -r requirements.txt
|
48 |
-
```
|
49 |
-
|
50 |
-
## Set `PYTHONPATH` if needed
|
51 |
-
|
52 |
-
```bash
|
53 |
-
export PYTHONPATH="${PYTHONPATH}:$PWD"
|
54 |
-
```
|
55 |
-
|
56 |
-
## Inference
|
57 |
-
|
58 |
-
### Inference on the full dataset
|
59 |
-
|
60 |
-
You can run inference on the complete test dataset using the following command:
|
61 |
-
|
62 |
-
```bash
|
63 |
-
python inference.py -s ./dataset/test
|
64 |
-
```
|
65 |
-
|
66 |
-
### Inference on a small dataset
|
67 |
-
|
68 |
-
To perform inference on a small subset of the dataset, use the --subsample flag:
|
69 |
-
|
70 |
-
```bash
|
71 |
-
python inference.py -s ./dataset/test --subsample
|
72 |
-
```
|
73 |
-
|
74 |
-
## Run ui
|
75 |
-
|
76 |
-
To run the UI for interacting with the models and viewing results, use Streamlit:
|
77 |
-
|
78 |
-
```bash
|
79 |
-
streamlit run ui.py
|
80 |
-
```
|
81 |
-
|
82 |
-
## Run api
|
83 |
-
|
84 |
-
To start the API for the model, you'll need FastAPI. Run the following command:
|
85 |
-
|
86 |
-
```bash
|
87 |
-
fastapi run api.py
|
88 |
-
```
|
89 |
-
|
90 |
-
## Experiments
|
91 |
-
|
92 |
-
This repository supports two main types of experiments:
|
93 |
-
|
94 |
-
1. Fine-tuning models from the BERT family.
|
95 |
-
2. Fine-tuning models from the GLiNER family.
|
96 |
-
|
97 |
-
Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
|
98 |
-
arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
|
99 |
-
|
100 |
-
### BERT Fine-Tuning
|
101 |
-
|
102 |
-
The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
|
103 |
-
can utilize alternative columns that are preprocessed during the data preparation phase.
|
104 |
-
|
105 |
-
```bash
|
106 |
-
python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
|
107 |
-
```
|
108 |
-
|
109 |
-
#### Available BERT models
|
110 |
-
|
111 |
-
Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
|
112 |
-
may also work with minimal modifications:
|
113 |
-
|
114 |
-
- BERT classic
|
115 |
-
+ `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
|
116 |
-
- DistilBERT
|
117 |
-
+ `distilbert-base-uncased`, `distilbert-base-cased`
|
118 |
-
- RoBERTa
|
119 |
-
+ `roberta-base`, `roberta-large`
|
120 |
-
- ALBERT
|
121 |
-
+ `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
|
122 |
-
- Electra
|
123 |
-
+ `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
|
124 |
-
- DeBERTa
|
125 |
-
+ `microsoft/deberta-base`, `microsoft/deberta-large`
|
126 |
-
|
127 |
-
### GLiNER Fine-Tuning
|
128 |
-
|
129 |
-
The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
|
130 |
-
happens in two stages:
|
131 |
-
|
132 |
-
1. Step 1: Prepare Dataset for GLiNER Models
|
133 |
-
Run the GLiNER dataset preparation script to pre-process your dataset:
|
134 |
-
|
135 |
-
```bash
|
136 |
-
python experiments/gliner_prepare.py --dataset path/to/dataset
|
137 |
-
```
|
138 |
-
|
139 |
-
This will create a new JSON-formatted dataset file with the same name in the specified output directory.
|
140 |
-
|
141 |
-
2. Step 2: Fine-Tune GLiNER Model
|
142 |
-
|
143 |
-
```bash
|
144 |
-
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
|
145 |
-
```
|
146 |
-
|
147 |
-
After the dataset preparation, run the GLiNER fine-tuning script:
|
148 |
-
|
149 |
-
```bash
|
150 |
-
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
|
151 |
-
```
|
152 |
-
|
153 |
-
#### Available GLiNER models
|
154 |
-
|
155 |
-
You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
|
156 |
-
|
157 |
-
- `gliner-community/gliner_xxl-v2.5`
|
158 |
-
- `gliner-community/gliner_large-v2.5`
|
159 |
-
- `gliner-community/gliner_medium-v2.5`
|
160 |
-
- `gliner-community/gliner_small-v2.5`
|
161 |
-
|
162 |
-
## Results
|
163 |
-
|
164 |
-
A results folder is available in the repository to store the results of the various experiments and related metrics.
|
165 |
-
|
166 |
-
## Other Information
|
167 |
-
|
168 |
-
We also provide a solution to the issue in
|
169 |
-
the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
|
170 |
-
We created a method to transform the natural language text into a token-tag format that can be used to train a Named
|
171 |
-
Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.
|
172 |
|
173 |
## Disclaimer
|
174 |
|
|
|
33 |
including client support, legal, and general data anonymization tools. Success in this project will contribute to
|
34 |
scaling privacy-conscious AI systems without compromising the UX or operational performance.
|
35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
## Disclaimer
|
38 |
|