bobocn
/

ai4privacy-mdeberta-v3-base-general-preprocessed

Safetensors

Model card Files Files and versions Community

bobocn commited on Oct 27, 2024

Commit

152ce66

verified ·

1 Parent(s): a140157

Update README.md

Browse files

Files changed (1) hide show

README.md +121 -1

README.md CHANGED Viewed

@@ -9,4 +9,124 @@ metrics:
 - recall
 base_model:
 - microsoft/mdeberta-v3-base
----

 - recall
 base_model:
 - microsoft/mdeberta-v3-base
+---
+# Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed
+This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on [NeuralWave](https://neuralwave.ch/#/) Hackthon.
+## Model Details
+This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.
+We use the following arguments for training variable for finetuning:
+- learning_rate=3e-5,
+- per_device_train_batch_size=58,
+- per_device_eval_batch_size=58,
+- num_train_epochs=3,
+- weight_decay=0.01,
+- bf16=True,
+- seed=42
+and other default hyperparameters of TrainingArguments.
+## Training Data
+[ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k)
+## Preprocessing
+```python
+def generate_sequence_labels(text, privacy_mask):
+    # sort privacy mask by start position
+    privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
+    # replace sensitive pieces of text with labels
+    for item in privacy_mask:
+        label = item['label']
+        start = item['start']
+        end = item['end']
+        value = item['value']
+        # count the number of words in the value
+        word_count = len(value.split())
+        # replace the sensitive information with the appropriate number of [label] placeholders
+        replacement = " ".join([f"{label}" for _ in range(word_count)])
+        text = text[:start] + replacement + text[end:]
+    words = text.split()
+    # assign labels to each word
+    labels = []
+    for word in words:
+        match = re.search(r"(\w+)", word)  # match any word character
+        if match:
+            label = match.group(1)
+            if label in label_set:
+                labels.append(label)
+            else:
+                # any other word is labeled as "O"
+                labels.append("O")
+        else:
+            labels.append("O")
+    return labels
+```
+```python
+k = 0
+def tokenize_and_align_labels(examples):
+    words = [t.split() for t in examples["source_text"]]
+    tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
+    source_labels = [
+        generate_sequence_labels(text, mask)
+        for text, mask in zip(examples["source_text"], examples["privacy_mask"])
+    ]
+    labels = []
+    valid_idx = []
+    for i, label in enumerate(source_labels):
+        word_ids = tokenized_inputs.word_ids(batch_index=i)  # map tokens to their respective word.
+        previous_label = None
+        label_ids = [-100]
+        try:
+            for word_idx in word_ids:
+                if word_idx is None:
+                    continue
+                elif label[word_idx] == "O":
+                    label_ids.append(label2id["O"])
+                    continue
+                elif previous_label == label[word_idx]:
+                    label_ids.append(label2id[f"I-{label[word_idx]}"])
+                else:
+                    label_ids.append(label2id[f"B-{label[word_idx]}"])
+                previous_label = label[word_idx]
+            label_ids = label_ids[:511] + [-100]
+            labels.append(label_ids)
+            # print(word_ids)
+            # print(label_ids)
+        except:
+            global k
+            k += 1
+            # print(f"{word_idx = }")
+            # print(f"{len(label) = }")
+            labels.append([-100] * len(tokenized_inputs["input_ids"][i]))
+    tokenized_inputs["labels"] = labels
+    return tokenized_inputs
+```
+We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you
+can use any kinds of models and tokenizers to train on [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k).
+## Evaluation
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/671e31b377035878c5f4082a/kzlMRqXBz80y63CmqDWDx.png)
+Some evaluation of this model on validation set (model 2) is shown in the table.
+## Disclaimer Cooment of Non-Affiliation
+The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.
+@NerualWave 2024 - *The Last Ones* Team.