🌟 Buying me coffee is a direct way to show support for this project.

distilbert_finetuned_ai4privacy_v2

This model is a fine-tuned version of distilbert-base-uncased on the English Subset of ai4privacy/pii-masking-200k dataset.

Useage

GitHub Implementation: Ai4Privacy

Model description

This model has been finetuned on the World's largest open source privacy dataset.

The purpose of the trained models is to remove personally identifiable information (PII) from text, especially in the context of AI assistants and LLMs.

The example texts have 54 PII classes (types of sensitive data), targeting 229 discussion subjects / use cases split across business, education, psychology and legal fields, and 5 interactions styles (e.g. casual conversation, formal document, emails etc...).

Take a look at the Github implementation for specific reasearch.

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine_with_restarts
lr_scheduler_warmup_ratio: 0.2
num_epochs: 5

Class wise metrics

It achieves the following results on the evaluation set:

Loss: 0.0451
Overall Precision: 0.9438
Overall Recall: 0.9663
Overall F1: 0.9549
Overall Accuracy: 0.9838
Accountname F1: 0.9946
Accountnumber F1: 0.9940
Age F1: 0.9624
Amount F1: 0.9643
Bic F1: 0.9929
Bitcoinaddress F1: 0.9948
Buildingnumber F1: 0.9845
City F1: 0.9955
Companyname F1: 0.9962
County F1: 0.9877
Creditcardcvv F1: 0.9643
Creditcardissuer F1: 0.9953
Creditcardnumber F1: 0.9793
Currency F1: 0.7811
Currencycode F1: 0.8850
Currencyname F1: 0.2281
Currencysymbol F1: 0.9562
Date F1: 0.9061
Dob F1: 0.7914
Email F1: 1.0
Ethereumaddress F1: 1.0
Eyecolor F1: 0.9837
Firstname F1: 0.9846
Gender F1: 0.9971
Height F1: 0.9910
Iban F1: 0.9906
Ip F1: 0.4349
Ipv4 F1: 0.8126
Ipv6 F1: 0.7679
Jobarea F1: 0.9880
Jobtitle F1: 0.9991
Jobtype F1: 0.9777
Lastname F1: 0.9684
Litecoinaddress F1: 0.9721
Mac F1: 1.0
Maskednumber F1: 0.9635
Middlename F1: 0.9330
Nearbygpscoordinate F1: 1.0
Ordinaldirection F1: 0.9910
Password F1: 1.0
Phoneimei F1: 0.9918
Phonenumber F1: 0.9962
Pin F1: 0.9477
Prefix F1: 0.9546
Secondaryaddress F1: 0.9892
Sex F1: 0.9876
Ssn F1: 0.9976
State F1: 0.9893
Street F1: 0.9873
Time F1: 0.9889
Url F1: 1.0
Useragent F1: 0.9953
Username F1: 0.9975
Vehiclevin F1: 1.0
Vehiclevrm F1: 1.0
Zipcode F1: 0.9873

Training results

Training Loss	Epoch	Step	Validation Loss	Overall Precision	Overall Recall	Overall F1	Overall Accuracy	Accountname F1	Accountnumber F1	Age F1	Amount F1	Bic F1	Bitcoinaddress F1	Buildingnumber F1	City F1	Companyname F1	County F1	Creditcardcvv F1	Creditcardissuer F1	Creditcardnumber F1	Currency F1	Currencycode F1	Currencyname F1	Currencysymbol F1	Date F1	Dob F1	Email F1	Ethereumaddress F1	Eyecolor F1	Firstname F1	Gender F1	Height F1	Iban F1	Ip F1	Ipv4 F1	Ipv6 F1	Jobarea F1	Jobtitle F1	Jobtype F1	Lastname F1	Litecoinaddress F1	Mac F1	Maskednumber F1	Middlename F1	Nearbygpscoordinate F1	Ordinaldirection F1	Password F1	Phoneimei F1	Phonenumber F1	Pin F1	Prefix F1	Secondaryaddress F1	Sex F1	Ssn F1	State F1	Street F1	Time F1	Url F1	Useragent F1	Username F1	Vehiclevin F1	Vehiclevrm F1	Zipcode F1
0.6445	1.0	1088	0.3322	0.6449	0.7003	0.6714	0.8900	0.7607	0.8733	0.6576	0.1766	0.25	0.6783	0.3621	0.6005	0.6909	0.5586	0.0	0.2449	0.7095	0.2889	0.0	0.0	0.3902	0.7720	0.0	0.9862	0.8011	0.5088	0.7740	0.7118	0.5434	0.8088	0.0	0.8303	0.7562	0.5318	0.7294	0.4681	0.6779	0.0	0.8909	0.0	0.0107	0.9985	0.4000	0.7307	0.9057	0.8618	0.0	0.9127	0.8235	0.9211	0.8026	0.4656	0.6390	0.9383	0.9775	0.8868	0.8201	0.4526	0.0550	0.5368
0.222	2.0	2176	0.1259	0.8170	0.8747	0.8449	0.9478	0.9708	0.9813	0.7638	0.7427	0.7837	0.8908	0.8833	0.8747	0.9814	0.8749	0.7601	0.9777	0.8834	0.5372	0.4828	0.0056	0.7785	0.8149	0.3140	0.9956	0.9935	0.9101	0.9270	0.9450	0.9853	0.9253	0.0650	0.0084	0.7962	0.9013	0.9446	0.9203	0.8555	0.6885	1.0	0.7152	0.6442	1.0	0.9623	0.9349	0.9905	0.9782	0.7656	0.9324	0.9903	0.9736	0.9274	0.8520	0.9138	0.9678	0.9922	0.9893	0.9804	0.9646	0.8556	0.8385
0.1331	3.0	3264	0.0773	0.9133	0.9371	0.9250	0.9654	0.9822	0.9815	0.9196	0.8852	0.9718	0.9785	0.9215	0.9757	0.9935	0.9651	0.8742	0.9921	0.9438	0.7568	0.7710	0.0	0.8998	0.7895	0.6578	0.9994	1.0	0.9554	0.9525	0.9823	0.9910	0.9866	0.0435	0.8293	0.7824	0.9671	0.9794	0.9571	0.9447	0.9141	1.0	0.8825	0.7988	1.0	0.9797	0.9921	0.9932	0.9943	0.8726	0.9401	0.9860	0.9792	0.9928	0.9740	0.9604	0.9730	0.9983	0.9964	0.9959	0.9890	0.9774	0.9247
0.0847	4.0	4352	0.0503	0.9368	0.9614	0.9489	0.9789	0.9955	0.9949	0.9573	0.9480	0.9929	0.9846	0.9808	0.9927	0.9962	0.9811	0.9436	0.9953	0.9695	0.7826	0.8713	0.1653	0.9458	0.8782	0.7996	1.0	1.0	0.9809	0.9816	0.9941	0.9910	0.9906	0.3389	0.8364	0.7066	0.9862	1.0	0.9795	0.9637	0.9429	1.0	0.9438	0.9165	1.0	0.9864	1.0	0.9932	0.9962	0.9352	0.9483	0.9860	0.9866	0.9976	0.9884	0.9827	0.9881	1.0	0.9953	0.9975	0.9945	0.9915	0.9841
0.0557	5.0	5440	0.0451	0.9438	0.9663	0.9549	0.9838	0.9946	0.9940	0.9624	0.9643	0.9929	0.9948	0.9845	0.9955	0.9962	0.9877	0.9643	0.9953	0.9793	0.7811	0.8850	0.2281	0.9562	0.9061	0.7914	1.0	1.0	0.9837	0.9846	0.9971	0.9910	0.9906	0.4349	0.8126	0.7679	0.9880	0.9991	0.9777	0.9684	0.9721	1.0	0.9635	0.9330	1.0	0.9910	1.0	0.9918	0.9962	0.9477	0.9546	0.9892	0.9876	0.9976	0.9893	0.9873	0.9889	1.0	0.9953	0.9975	1.0	1.0	0.9873

Framework versions

Transformers 4.35.0
Pytorch 2.0.0
Datasets 2.1.0
Tokenizers 0.14.1

Isotonic
/

distilbert_finetuned_ai4privacy_v2

distilbert_finetuned_ai4privacy_v2

Useage

Model description

Intended uses & limitations

Training and evaluation data

Training hyperparameters

Class wise metrics

Training results

Framework versions

Model tree for Isotonic/distilbert_finetuned_ai4privacy_v2

Datasets used to train Isotonic/distilbert_finetuned_ai4privacy_v2

Space using Isotonic/distilbert_finetuned_ai4privacy_v2 1

Collection including Isotonic/distilbert_finetuned_ai4privacy_v2

AI4Privacy_v2

Evaluation results