File size: 4,313 Bytes

---
license: cc-by-nc-4.0
datasets:
- ai4privacy/pii-masking-400k
metrics:
- accuracy
- f1
- precision
- recall
base_model:
- microsoft/mdeberta-v3-base
---
# Model Card for ai4privacy-mdeberta-v3-base-general-preprocessed

This is a model aiming to detect the PII (Personal Identifiable Information), training by "The Last Ones" team on [NeuralWave](https://neuralwave.ch/#/) Hackthon.


## Model Details

This model was fine-tuned from microsoft/mdeberta-v3-base on ai4privacy/pii-masking-400k dataset.

We use the following arguments for training variable for finetuning:
- learning_rate=3e-5,
- per_device_train_batch_size=58,
- per_device_eval_batch_size=58,
- num_train_epochs=3,
- weight_decay=0.01,
- bf16=True,
- seed=42

and other default hyperparameters of TrainingArguments.

## Training Data

[ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k)

## Preprocessing

```python
def generate_sequence_labels(text, privacy_mask):
    # sort privacy mask by start position
    privacy_mask = sorted(privacy_mask, key=lambda x: x['start'], reverse=True)
    
    # replace sensitive pieces of text with labels
    for item in privacy_mask:
        label = item['label']
        start = item['start']
        end = item['end']
        value = item['value']
        # count the number of words in the value
        word_count = len(value.split())
        
        # replace the sensitive information with the appropriate number of [label] placeholders
        replacement = " ".join([f"{label}" for _ in range(word_count)])
        text = text[:start] + replacement + text[end:]
        
    words = text.split()
    # assign labels to each word
    labels = []
    for word in words:
        match = re.search(r"(\w+)", word)  # match any word character
        if match:
            label = match.group(1)
            if label in label_set:
                labels.append(label)
            else:
                # any other word is labeled as "O"
                labels.append("O")
        else:
            labels.append("O")
    return labels
```

```python
k = 0
def tokenize_and_align_labels(examples):
    words = [t.split() for t in examples["source_text"]]
    tokenized_inputs = tokenizer(words, truncation=True, is_split_into_words=True, max_length=512)
    source_labels = [
        generate_sequence_labels(text, mask)
        for text, mask in zip(examples["source_text"], examples["privacy_mask"])
    ]

    labels = []
    valid_idx = []
    for i, label in enumerate(source_labels):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # map tokens to their respective word.
        previous_label = None
        label_ids = [-100]
        try:
            for word_idx in word_ids:
                if word_idx is None:
                    continue
                elif label[word_idx] == "O":
                    label_ids.append(label2id["O"])
                    continue
                elif previous_label == label[word_idx]:
                    label_ids.append(label2id[f"I-{label[word_idx]}"])
                else:
                    label_ids.append(label2id[f"B-{label[word_idx]}"])
                previous_label = label[word_idx]
            label_ids = label_ids[:511] + [-100]
            labels.append(label_ids)
            # print(word_ids)
            # print(label_ids)
        except:
            global k
            k += 1
            # print(f"{word_idx = }")
            # print(f"{len(label) = }")
            labels.append([-100] * len(tokenized_inputs["input_ids"][i]))

    tokenized_inputs["labels"] = labels
    return tokenized_inputs
```
We use this two function to generate the source-text-level labels and then use it to align the tokens and token-level labels so that you 
can use any kinds of models and tokenizers to train on [ai4privacy/pii-masking-400k](https://huggingface.co/datasets/ai4privacy/pii-masking-400k).


## Evaluation

![image/png](https://cdn-uploads.huggingface.co/production/uploads/671e31b377035878c5f4082a/kzlMRqXBz80y63CmqDWDx.png)

Some evaluation of this model on validation set (model 2) is shown in the table.


## Disclaimer Cooment of Non-Affiliation

The publisher of this repository is not affiliate with Ai4Privacy and Ai Suisse SA.


@NerualWave 2024 - *The Last Ones* Team.