UlrikKoren commited on
Commit
e52fc98
1 Parent(s): 269937b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -0
README.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # PIIMask-EN Model
3
+
4
+ The PIIMask-EN model is a specialized language model fine-tuned for the task of Personal Identifiable Information (PII) redaction. It is based on the "google/gemma-1.1-2b-it" model and trained to identify and redact various types of PII in text while maintaining the grammatical structure of sentences.
5
+
6
+ ## Model Description
7
+
8
+ - **Model Name:** PIIMask-EN
9
+ - **Base Model:** [google/gemma-1.1-2b-it](https://huggingface.co/google/gemma-1.1-2b-it)
10
+ - **Fine-tuning Dataset:** [ai4privacy/pii-masking-65k](https://huggingface.co/datasets/ai4privacy/pii-masking-65k) (specifically `english_balanced_10k.jsonl` subset)
11
+ - **Quantization:** 4-bit quantization using NF4 with double quantization and float16 compute dtype.
12
+ - **Training Steps:** The model checkpoints are available at 1, 2, 3, and 4 epochs.
13
+
14
+ ## Methodology
15
+
16
+ The PIIMask-EN model was fine-tuned using the ai4privacy/pii-masking-65k dataset, which contains various text entries annotated with different types of PII. The training process involved several epochs to improve the model's ability to accurately redact PII from text. The quantization configuration was applied to make the model more efficient for deployment.
17
+
18
+ ## Usage
19
+
20
+ ### Installation
21
+
22
+ To use the PIIMask-EN model, you need to have the `transformers` and `datasets` libraries installed. You can install them using pip:
23
+
24
+ ```bash
25
+ pip install transformers datasets
26
+ ```
27
+
28
+ ### Code Example
29
+
30
+ Here is a code example to load and use the PIIMask-EN model for PII redaction:
31
+
32
+ ```python
33
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
34
+ import torch
35
+
36
+ # Quantization configuration
37
+ bnb_config = BitsAndBytesConfig(
38
+ load_in_4bit=True,
39
+ bnb_4bit_quant_type="nf4",
40
+ bnb_4bit_compute_dtype=torch.float16,
41
+ bnb_4bit_use_double_quant=True,
42
+ )
43
+
44
+ # System instructions for PII redaction
45
+ system_instructions = """Replace the following types of personal information in the text below with '[REDACTED]': [DATE_x], [MASKEDNUMBER_x], [STREETADDRESS_x]. Ensure that each type of information is replaced in a way that maintains the grammatical structure of the sentence. You should only return the new text with the relevant replacements made, without the original text or any additional annotations.
46
+ Input:"""
47
+
48
+ example_prompt = "My name is Clara and I live in Berkeley, California."
49
+
50
+ # Load model function
51
+ def load_model(repo, step):
52
+ model = AutoModelForCausalLM.from_pretrained(repo,
53
+ device_map="cuda:0",
54
+ trust_remote_code=True,
55
+ quantization_config=bnb_config,
56
+ adapter_kwargs={"subfolder": f"checkpoint-{step}"},
57
+ attn_implementation="flash_attention_2")
58
+ return model
59
+
60
+ # Initialize tokenizer and model
61
+ device = "cuda"
62
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-2b-it", use_fast=True)
63
+
64
+ # Apply chat template for input
65
+ chat = [
66
+ {"role": "system", "content": system_instructions},
67
+ {"role": "user", "content": example_prompt},
68
+ ]
69
+
70
+ inputs = tokenizer.apply_chat_template(chat, tokenize=False, return_tensors="pt", padding=True, truncation=False)
71
+ model = load_model("UlrikKoren/PIIMask-EN", step=1)
72
+ outputs = model.generate(input_ids=inputs['input_ids'].to(device), max_new_tokens=2048)
73
+ decoded_outputs = [tokenizer.decode(output, skip_special_tokens=False) for output in outputs]
74
+ print(decoded_outputs[0])
75
+ ```
76
+
77
+ ### Checkpoints
78
+
79
+ The model checkpoints for different training epochs can be accessed as follows:
80
+ - **Epoch 1:** `UlrikKoren/PIIMask-EN/tree/main/checkpoint-579`
81
+ - **Epoch 2:** `UlrikKoren/PIIMask-EN/checkpoint-1159`
82
+ - **Epoch 3:** `UlrikKoren/PIIMask-EN/checkpoint-1739`
83
+ - **Epoch 4:** `UlrikKoren/PIIMask-EN/checkpoint-2316`
84
+
85
+
86
+ ## Compliance with Gemma Terms of Use
87
+
88
+ This model is a derivative of the "google/gemma-1.1-2b-it" model and complies with the Gemma Terms of Use:
89
+
90
+ - **Distribution:** Any distribution of this model or its derivatives must include the use restrictions specified in the Gemma Terms of Use and provide notice to subsequent users.
91
+ - **Notices:** The model is distributed with the following notice: “Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms”.
92
+ - **Modifications:** Any modified files carry prominent notices stating the modifications made.
93
+ - **Prohibited Uses:** The use of this model is subject to the restrictions outlined in the Gemma Prohibited Use Policy.
94
+ - **Trademarks:** This distribution does not grant any rights to use Google’s trademarks, trade names, or logos.
95
+
96
+ ## License
97
+
98
+ The PIIMask-EN model is distributed under the same terms as the base model. For more details, please refer to the [Gemma Terms of Use](https://ai.google.dev/gemma/terms).