molise-ai
/

pii-detector-ai4privacy

Token Classification

Transformers

Safetensors

distilbert

Inference Endpoints

Model card Files Files and versions Community

RedHitMark commited on Nov 3, 2024

Commit

a26cffe

verified ·

1 Parent(s): 16cd4d1

Update README.md

Browse files

Files changed (1) hide show

README.md +0 -136

README.md CHANGED Viewed

@@ -33,142 +33,6 @@ The final solution could be integrated into various systems and enhance privacy
 including client support, legal, and general data anonymization tools. Success in this project will contribute to
 scaling privacy-conscious AI systems without compromising the UX or operational performance.
-## Getting Started
-Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
-```bash
-cp .env.example .env
-```
-### Install the dependencies
-```bash
-pip install -r requirements.txt
-```
-## Set `PYTHONPATH` if needed
-```bash
-export PYTHONPATH="${PYTHONPATH}:$PWD"
-```
-## Inference
-### Inference on the full dataset
-You can run inference on the complete test dataset using the following command:
-```bash
-python inference.py -s ./dataset/test
-```
-### Inference on a small dataset
-To perform inference on a small subset of the dataset, use the --subsample flag:
-```bash
-python inference.py -s ./dataset/test --subsample
-```
-## Run ui
-To run the UI for interacting with the models and viewing results, use Streamlit:
-```bash
-streamlit run ui.py
-```
-## Run api
-To start the API for the model, you'll need FastAPI. Run the following command:
-```bash
-fastapi run api.py
-```
-## Experiments
-This repository supports two main types of experiments:
-1. Fine-tuning models from the BERT family.
-2. Fine-tuning models from the GLiNER family.
-Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
-arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
-### BERT Fine-Tuning
-The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
-can utilize alternative columns that are preprocessed during the data preparation phase.
-```bash
-python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
-```
-#### Available BERT models
-Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
-may also work with minimal modifications:
-- BERT classic
-    + `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
-- DistilBERT
-    + `distilbert-base-uncased`, `distilbert-base-cased`
-- RoBERTa
-    + `roberta-base`, `roberta-large`
-- ALBERT
-    + `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
-- Electra
-    + `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
-- DeBERTa
-    + `microsoft/deberta-base`, `microsoft/deberta-large`
-### GLiNER Fine-Tuning
-The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
-happens in two stages:
-1. Step 1: Prepare Dataset for GLiNER Models
-   Run the GLiNER dataset preparation script to pre-process your dataset:
-```bash
-python experiments/gliner_prepare.py --dataset path/to/dataset
-```
-This will create a new JSON-formatted dataset file with the same name in the specified output directory.
-2. Step 2: Fine-Tune GLiNER Model
-```bash
-python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
-```
-After the dataset preparation, run the GLiNER fine-tuning script:
-```bash
-python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
-```
-#### Available GLiNER models
-You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
-- `gliner-community/gliner_xxl-v2.5`
-- `gliner-community/gliner_large-v2.5`
-- `gliner-community/gliner_medium-v2.5`
-- `gliner-community/gliner_small-v2.5`
-## Results
-A results folder is available in the repository to store the results of the various experiments and related metrics.
-## Other Information
-We also provide a solution to the issue in
-the [pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3) repository.
-We created a method to transform the natural language text into a token-tag format that can be used to train a Named
-Entity Recognition (NER) model using the `AutoTrain` `huggingface` api.
 ## Disclaimer

 including client support, legal, and general data anonymization tools. Success in this project will contribute to
 scaling privacy-conscious AI systems without compromising the UX or operational performance.
 ## Disclaimer