inference: false
license: apache-2.0
datasets:
- metricspace/AnonymeData
pipeline_tag: text2text-generation
EntityAnonymization-3B-V0.9
EntityAnonymization identifies entities in texts and replaces them with randomised versions.
In a first pass, the entities are recognised and a dictionary with similar but randomised variants is created.
In a second run, the original text and the dictionary are provided and the paraphrased variant is generated.
The two-step approach allows the dictionary to be cached and converted back to an anonymised text that has been further processed.
License
This Natural Language Processing (NLP) model is made available under the Apache License, Version 2.0. You are free to use, modify, and distribute this software according to the terms and conditions of the Apache 2.0 License. For the full license text, please refer to the Apache 2.0 License.
Usage and Specific Capabilities
Text Length Limitation
The model is optimized to analyze texts containing up to 2048 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 2048 tokens. Each chunk can then be processed separately.
Supported Languages
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
Use Cases
Entity Resampling and Anonymization
Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.
Example Usage
!pip install sentencepiece
!pip install transformers
import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
model.to("cuda:0")
def extract_last_assistant_response(input_text):
# Find the occurrence of "ASSISTANT:" in the input text
match = re.search(r'ASSISTANT:', input_text)
# Get the index where the last "ASSISTANT:" ends
start_index = match.end()
response = input_text[start_index:].strip()
return response
# Input example
text_to_anonymize = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Benjamin Mitchell
Dear Mrs. Alice Williams,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Benjamin Mitchell.
Employee Details:
Name: Benjamin Mitchell
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Fitzgerald
Incident Details:
Date: October 25, 2023
Location: Restroom, 4th Floor
Time: 11:45 AM
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Mitchell, which raised concerns about potential drug misuse. Witnesses mentioned that Benjamin appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Emily Clark: "Benjamin seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Robert Taylor: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Benjamin Mitchell was approached by HR for a preliminary conversation to understand the situation.
Mrs. Jane Fitzgerald, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Mitchell to understand if there's an underlying issue.
Consider referring Benjamin to our Employee Assistance Program (EAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Mitchell and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
print(text_to_anonymize)
# Step 1: Extracting entities from text
prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
output_entities = model.generate(inputs.input_ids, max_new_tokens=300, do_sample=False, temperature=0.8, penalty_alpha=1.3, top_k=180, num_beams=5, repetition_penalty=2.3)
raw_output_entities_text = tokenizer.decode(output_entities[0])
entities = extract_last_assistant_response(raw_output_entities_text)
print('-----------Entities----------------')
try:
entities = re.search(r"\{.*?\}", entities, re.DOTALL).group(0)
data_dict = eval(entities)
formatted_json = json.dumps(data_dict, indent=4)
print(formatted_json)
except:
#bad formated json
print(entities)
#output
'''
{
"Mr. Benjamin Mitchell": "Mr. Edward Martin",
"Mrs. Alice Williams": "Mrs. Charlotte Johnson",
"January 15, 2020": "January 15, 2020",
"Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
"October 25, 2023": "October 25, 2023",
"4th Floor": "topmost floor",
"11:45 AM": "midday",
"Emily Clark": "Marie Foster",
"Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
"Robert Taylor": "Benjamin Adams",
}
'''
# Step 2: Use entities to resample the original text
prompt_2 = f"USER: Rephrase with {entities}: {text_to_anonymize}\n\nASSISTANT:"
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda:0')
output_resampled = model.generate(inputs.input_ids, max_length=2048)
raw_output_resampled_text = tokenizer.decode(output_resampled[0])
resampled_text = extract_last_assistant_response(raw_output_resampled_text)
print('---------Anonymized Version--------')
print(resampled_text)
#output:
'''
Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
Dear Mrs. Charlotte Johnson,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
Employee Details:
Name: Edward Martin
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Anderson
Incident Details:
Date: October 25, 2023
Location: Restroom, topmost floor
Time: midday
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
Example: Process anonymized version with GPT4 and change entities back
import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
model.to("cuda:0")
# Anonymized input
anonymized_text = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
Dear Mrs. Charlotte Johnson,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
Employee Details:
Name: Edward Martin
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Anderson
Incident Details:
Date: October 25, 2023
Location: Restroom, topmost floor
Time: midday
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
# Entities map
entities_map = '''
{
"Mr. Benjamin Mitchell": "Mr. Edward Martin",
"Mrs. Alice Williams": "Mrs. Charlotte Johnson",
"January 15, 2020": "January 15, 2020",
"Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
"October 25, 2023": "October 25, 2023",
"4th Floor": "topmost floor",
"11:45 AM": "midday",
"Emily Clark": "Marie Foster",
"Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
"Robert Taylor": "Benjamin Adams",
}
'''
# Step 1: Processing anonymized text with GPT-4
import openai
openai.api_key = f"<API_KEY>"
completion = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Write an official warning letter to the employee, that we do not tolerate it and with the next incident he gets fired in the name of Mrs. Charlotte Johnson Human Resources Manager. Here is the report with information. {anonymized_info}"}
]
)
print(completion.choices[0].message.content)
#output
'''
Subject: Official Warning – Substance Misuse Policy Violation
Dear Mr. Edward Martin,
We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on October 25, 2023, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach of company guidelines may lead to further disciplinary action, up to and including termination of employment.
Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace environment. We encourage you to utilize our Personal Assistance Program (PAP) as a tool for counseling and support, if needed.
We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention to this important issue.
Yours sincerely,
Mrs. Charlotte Johnson
Human Resources Manager
'''
# Step 2: Replace the entities back in processed by GPT-4 text.
import ast
def swap_keys_and_values_in_string(input_str):
# Convert the input string to a dictionary
input_dict = ast.literal_eval(input_str)
# Swap the keys and values
swapped_dict = {v: k for k, v in input_dict.items()}
# Convert the swapped dictionary back to a string
swapped_str = str(swapped_dict)
return swapped_str
gpt_response = completion.choices[0].message.content
entities_map = swap_keys_and_values_in_string(entities_map)
prompt = f"USER: Rephrase with {entities_map}: {gpt_response}\n\nASSISTANT:"
inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
outputs = model.generate(inputs.input_ids, max_new_tokens=2048)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
#output:
'''
Subject: Official Warning – Substance Misuse Policy Violation
Dear Mr. Benjamin Mitchell,
We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on January 15,
2020, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and
were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any
type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal
drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach
of company guidelines may lead to further disciplinary action, up to and including termination of employment.
Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace
environment. We encourage you to utilize our Employee Assistance Program (EAP) as a tool for counseling and support, if needed.
We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention
to this important issue.
Yours sincerely,
Mrs. Alice Williams,
Human Resources Manager.
'''
…
Dataset and Training Documentation for Audit
If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us. Further Tuning Services for Custom Use Cases For specialized needs or custom use cases, we offer further tuning services to adapt the model to your specific requirements. To inquire about these services, please reach out to us at: 📧 Email: info@metric-space.ai Please note that the availability of the dataset, additional documentation, and tuning services may be subject to certain conditions and limitations.