PII Detection Model - Phi3 Mini Fine-Tuned
This repository contains a fine-tuned version of the Phi3 Mini model for detecting personally identifiable information (PII). The model has been specifically trained to recognize various PII entities in text, making it a powerful tool for tasks such as data redaction, privacy protection, and compliance with data protection regulations.
Model Overview
Model Architecture
- Base Model: Phi3 Mini
- Fine-Tuned For: PII detection
- Framework: Hugging Face Transformers
Detected PII Entities
The model is capable of detecting the following PII entities:
Personal Information:
firstname
middlename
lastname
sex
dob
(Date of Birth)age
gender
height
eyecolor
Contact Information:
email
phonenumber
url
username
useragent
Address Information:
street
city
state
county
zipcode
country
secondaryaddress
buildingnumber
ordinaldirection
Geographical Information:
nearbygpscoordinate
Organizational Information:
companyname
jobtitle
jobarea
jobtype
Financial Information:
accountname
accountnumber
creditcardnumber
creditcardcvv
creditcardissuer
iban
bic
currency
currencyname
currencysymbol
currencycode
amount
Unique Identifiers:
pin
ssn
imei
(Phone IMEI)mac
(MAC Address)vehiclevin
(Vehicle VIN)vehiclevrm
(Vehicle VRM)
Cryptocurrency Information:
bitcoinaddress
litecoinaddress
ethereumaddress
Other Information:
ip
(IP Address)ipv4
ipv6
maskednumber
password
time
ordinaldirection
prefix
Prompt Format
### Instruction:
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
### Input:
Greetings, Mason! Let's celebrate another year of wellness on 14/01/1977. Don't miss the event at 176,Apt. 388.
### Output:
Usage
Installation
To use this model, you'll need to have the transformers
library installed:
pip install transformers
Run Inference
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
model = AutoModelForTokenClassification.from_pretrained("ab-ai/PII-Model-Phi3-Mini")
input_text = "Hi Abner, just a reminder that your next primary care appointment is on 23/03/1926. Please confirm by replying to this email Nathen15@hotmail.com."
model_prompt = f"""### Instruction:
Identify and extract the following PII entities from the text, if present: companyname, pin, currencyname, email, phoneimei, litecoinaddress, currency, eyecolor, street, mac, state, time, vehiclevin, jobarea, date, bic, currencysymbol, currencycode, age, nearbygpscoordinate, amount, ssn, ethereumaddress, zipcode, buildingnumber, dob, firstname, middlename, ordinaldirection, jobtitle, bitcoinaddress, jobtype, phonenumber, height, password, ip, useragent, accountname, city, gender, secondaryaddress, iban, sex, prefix, ipv4, maskednumber, url, username, lastname, creditcardcvv, county, vehiclevrm, ipv6, creditcardissuer, accountnumber, creditcardnumber. Return the output in JSON format.
### Input:
{input_text}
### Output: """
inputs = tokenizer(model_prompt, return_tensors="pt").to(device)
# adjust max_new_tokens according to your need
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=120)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response) #{'middlename': ['Abner'], 'dob': ['23/03/1926'], 'email': ['Nathen15@hotmail.com']}
- Downloads last month
- 193
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.