File size: 4,313 Bytes
5c76845 a140157 5c76845 152ce66 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: cc-by-nc-4.0
datasets:
- ai4privacy/pii-masking-400k
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- microsoft/mdeberta-v3-base
---
# Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed
This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on [NeuralWave](https://neuralwave.ch/#/) Hackthon.
## Model Details
This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.
We use the following arguments for training variable for finetuning:
- learning_rate=3e-5,
- per_device_train_batch_size=58,
- per_device_eval_batch_size=58,
- num_train_epochs=3,
- weight_decay=0.01,
- bf16=True,
- seed=42
and other default hyperparameters of TrainingArguments.
## Training Data
[ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k)
## Preprocessing
```python
def generate_sequence_labels(text, privacy_mask):
# sort privacy mask by start position
privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
# replace sensitive pieces of text with labels
for item in privacy_mask:
label = item['label']
start = item['start']
end = item['end']
value = item['value']
# count the number of words in the value
word_count = len(value.split())
# replace the sensitive information with the appropriate number of [label] placeholders
replacement = " ".join([f"{label}" for _ in range(word_count)])
text = text[:start] + replacement + text[end:]
words = text.split()
# assign labels to each word
labels = []
for word in words:
match = re.search(r"(\w+)", word) # match any word character
if match:
label = match.group(1)
if label in label_set:
labels.append(label)
else:
# any other word is labeled as "O"
labels.append("O")
else:
labels.append("O")
return labels
```
```python
k = 0
def tokenize_and_align_labels(examples):
words = [t.split() for t in examples["source_text"]]
tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
source_labels = [
generate_sequence_labels(text, mask)
for text, mask in zip(examples["source_text"], examples["privacy_mask"])
]
labels = []
valid_idx = []
for i, label in enumerate(source_labels):
word_ids = tokenized_inputs.word_ids(batch_index=i) # map tokens to their respective word.
previous_label = None
label_ids = [-100]
try:
for word_idx in word_ids:
if word_idx is None:
continue
elif label[word_idx] == "O":
label_ids.append(label2id["O"])
continue
elif previous_label == label[word_idx]:
label_ids.append(label2id[f"I-{label[word_idx]}"])
else:
label_ids.append(label2id[f"B-{label[word_idx]}"])
previous_label = label[word_idx]
label_ids = label_ids[:511] + [-100]
labels.append(label_ids)
# print(word_ids)
# print(label_ids)
except:
global k
k += 1
# print(f"{word_idx = }")
# print(f"{len(label) = }")
labels.append([-100] * len(tokenized_inputs["input_ids"][i]))
tokenized_inputs["labels"] = labels
return tokenized_inputs
```
We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you
can use any kinds of models and tokenizers to train on [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k).
## Evaluation
![image/png](https://cdn-uploads.huggingface.co/production/uploads/671e31b377035878c5f4082a/kzlMRqXBz80y63CmqDWDx.png)
Some evaluation of this model on validation set (model 2) is shown in the table.
## Disclaimer Cooment of Non-Affiliation
The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.
@NerualWave 2024 - *The Last Ones* Team. |