bobocn commited on
Commit
152ce66
·
verified ·
1 Parent(s): a140157

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -1
README.md CHANGED
@@ -9,4 +9,124 @@ metrics:
9
  - recall
10
  base_model:
11
  - microsoft/mdeberta-v3-base
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - recall
10
  base_model:
11
  - microsoft/mdeberta-v3-base
12
+ ---
13
+ # Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed
14
+
15
+ This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on [NeuralWave](https://neuralwave.ch/#/) Hackthon.
16
+
17
+
18
+ ## Model Details
19
+
20
+ This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.
21
+
22
+ We use the following arguments for training variable for finetuning:
23
+ - learning_rate=3e-5,
24
+ - per_device_train_batch_size=58,
25
+ - per_device_eval_batch_size=58,
26
+ - num_train_epochs=3,
27
+ - weight_decay=0.01,
28
+ - bf16=True,
29
+ - seed=42
30
+
31
+ and other default hyperparameters of TrainingArguments.
32
+
33
+ ## Training Data
34
+
35
+ [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k)
36
+
37
+ ## Preprocessing
38
+
39
+ ```python
40
+ def generate_sequence_labels(text, privacy_mask):
41
+ # sort privacy mask by start position
42
+ privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
43
+
44
+ # replace sensitive pieces of text with labels
45
+ for item in privacy_mask:
46
+ label = item['label']
47
+ start = item['start']
48
+ end = item['end']
49
+ value = item['value']
50
+ # count the number of words in the value
51
+ word_count = len(value.split())
52
+
53
+ # replace the sensitive information with the appropriate number of [label] placeholders
54
+ replacement = " ".join([f"{label}" for _ in range(word_count)])
55
+ text = text[:start] + replacement + text[end:]
56
+
57
+ words = text.split()
58
+ # assign labels to each word
59
+ labels = []
60
+ for word in words:
61
+ match = re.search(r"(\w+)", word) # match any word character
62
+ if match:
63
+ label = match.group(1)
64
+ if label in label_set:
65
+ labels.append(label)
66
+ else:
67
+ # any other word is labeled as "O"
68
+ labels.append("O")
69
+ else:
70
+ labels.append("O")
71
+ return labels
72
+ ```
73
+
74
+ ```python
75
+ k = 0
76
+ def tokenize_and_align_labels(examples):
77
+ words = [t.split() for t in examples["source_text"]]
78
+ tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
79
+ source_labels = [
80
+ generate_sequence_labels(text, mask)
81
+ for text, mask in zip(examples["source_text"], examples["privacy_mask"])
82
+ ]
83
+
84
+ labels = []
85
+ valid_idx = []
86
+ for i, label in enumerate(source_labels):
87
+ word_ids = tokenized_inputs.word_ids(batch_index=i) # map tokens to their respective word.
88
+ previous_label = None
89
+ label_ids = [-100]
90
+ try:
91
+ for word_idx in word_ids:
92
+ if word_idx is None:
93
+ continue
94
+ elif label[word_idx] == "O":
95
+ label_ids.append(label2id["O"])
96
+ continue
97
+ elif previous_label == label[word_idx]:
98
+ label_ids.append(label2id[f"I-{label[word_idx]}"])
99
+ else:
100
+ label_ids.append(label2id[f"B-{label[word_idx]}"])
101
+ previous_label = label[word_idx]
102
+ label_ids = label_ids[:511] + [-100]
103
+ labels.append(label_ids)
104
+ # print(word_ids)
105
+ # print(label_ids)
106
+ except:
107
+ global k
108
+ k += 1
109
+ # print(f"{word_idx = }")
110
+ # print(f"{len(label) = }")
111
+ labels.append([-100] * len(tokenized_inputs["input_ids"][i]))
112
+
113
+ tokenized_inputs["labels"] = labels
114
+ return tokenized_inputs
115
+ ```
116
+ We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you
117
+ can use any kinds of models and tokenizers to train on [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k).
118
+
119
+
120
+ ## Evaluation
121
+
122
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/671e31b377035878c5f4082a/kzlMRqXBz80y63CmqDWDx.png)
123
+
124
+ Some evaluation of this model on validation set (model 2) is shown in the table.
125
+
126
+
127
+ ## Disclaimer Cooment of Non-Affiliation
128
+
129
+ The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.
130
+
131
+
132
+ @NerualWave 2024 - *The Last Ones* Team.