Token Classification
Transformers
Safetensors
distilbert
Inference Endpoints
RedHitMark commited on
Commit
a26cffe
·
verified ·
1 Parent(s): 16cd4d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -136
README.md CHANGED
@@ -33,142 +33,6 @@ The final solution could be integrated into various systems and enhance privacy
33
  including client support, legal, and general data anonymization tools. Success in this project will contribute to
34
  scaling privacy-conscious AI systems without compromising the UX or operational performance.
35
 
36
- ## Getting Started
37
-
38
- Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
39
-
40
- ```bash
41
- cp .env.example .env
42
- ```
43
-
44
- ### Install the dependencies
45
-
46
- ```bash
47
- pip install -r requirements.txt
48
- ```
49
-
50
- ## Set `PYTHONPATH` if needed
51
-
52
- ```bash
53
- export PYTHONPATH="${PYTHONPATH}:$PWD"
54
- ```
55
-
56
- ## Inference
57
-
58
- ### Inference on the full dataset
59
-
60
- You can run inference on the complete test dataset using the following command:
61
-
62
- ```bash
63
- python inference.py -s ./dataset/test
64
- ```
65
-
66
- ### Inference on a small dataset
67
-
68
- To perform inference on a small subset of the dataset, use the --subsample flag:
69
-
70
- ```bash
71
- python inference.py -s ./dataset/test --subsample
72
- ```
73
-
74
- ## Run ui
75
-
76
- To run the UI for interacting with the models and viewing results, use Streamlit:
77
-
78
- ```bash
79
- streamlit run ui.py
80
- ```
81
-
82
- ## Run api
83
-
84
- To start the API for the model, you'll need FastAPI. Run the following command:
85
-
86
- ```bash
87
- fastapi run api.py
88
- ```
89
-
90
- ## Experiments
91
-
92
- This repository supports two main types of experiments:
93
-
94
- 1. Fine-tuning models from the BERT family.
95
- 2. Fine-tuning models from the GLiNER family.
96
-
97
- Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
98
- arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
99
-
100
- ### BERT Fine-Tuning
101
-
102
- The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
103
- can utilize alternative columns that are preprocessed during the data preparation phase.
104
-
105
- ```bash
106
- python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
107
- ```
108
-
109
- #### Available BERT models
110
-
111
- Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
112
- may also work with minimal modifications:
113
-
114
- - BERT classic
115
- + `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
116
- - DistilBERT
117
- + `distilbert-base-uncased`, `distilbert-base-cased`
118
- - RoBERTa
119
- + `roberta-base`, `roberta-large`
120
- - ALBERT
121
- + `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
122
- - Electra
123
- + `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
124
- - DeBERTa
125
- + `microsoft/deberta-base`, `microsoft/deberta-large`
126
-
127
- ### GLiNER Fine-Tuning
128
-
129
- The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
130
- happens in two stages:
131
-
132
- 1. Step 1: Prepare Dataset for GLiNER Models
133
- Run the GLiNER dataset preparation script to pre-process your dataset:
134
-
135
- ```bash
136
- python experiments/gliner_prepare.py --dataset path/to/dataset
137
- ```
138
-
139
- This will create a new JSON-formatted dataset file with the same name in the specified output directory.
140
-
141
- 2. Step 2: Fine-Tune GLiNER Model
142
-
143
- ```bash
144
- python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
145
- ```
146
-
147
- After the dataset preparation, run the GLiNER fine-tuning script:
148
-
149
- ```bash
150
- python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
151
- ```
152
-
153
- #### Available GLiNER models
154
-
155
- You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
156
-
157
- - `gliner-community/gliner_xxl-v2.5`
158
- - `gliner-community/gliner_large-v2.5`
159
- - `gliner-community/gliner_medium-v2.5`
160
- - `gliner-community/gliner_small-v2.5`
161
-
162
- ## Results
163
-
164
- A results folder is available in the repository to store the results of the various experiments and related metrics.
165
-
166
- ## Other Information
167
-
168
- We also provide a solution to the issue in
169
- the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
170
- We created a method to transform the natural language text into a token-tag format that can be used to train a Named
171
- Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.
172
 
173
  ## Disclaimer
174
 
 
33
  including client support, legal, and general data anonymization tools. Success in this project will contribute to
34
  scaling privacy-conscious AI systems without compromising the UX or operational performance.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ## Disclaimer
38