Wachu2005 commited on
Commit
ddb2e7d
·
1 Parent(s): 570b4ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -13
README.md CHANGED
@@ -11,34 +11,30 @@ This Natural Language Processing (NLP) model is made available under the Apache
11
  The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
12
  ## Supported Languages
13
  Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
14
- ## Example Usage
15
 
 
 
 
 
 
16
  ```python
17
  import torch
18
  from transformers import AutoTokenizer, AutoModelForCausalLM
19
  tokenizer = AutoTokenizer.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9")
20
  model = AutoModelForCausalLM.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9", torch_dtype=torch.bfloat16)
21
 
22
- prompt = f"Check for sensitive information: John, our patient, felt a throbbing headache and dizziness for two weeks. He was immediately..."
 
 
23
  inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
24
 
25
  max_length = 512
26
- inputs_length = inputs.input_ids.shape[1]
27
- max_new_tokens_value = max_length - inputs_length
28
-
29
- outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens_value, do_sample=False, top_k=50, top_p=0.98)
30
 
31
  result = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
32
 
33
  print(result)
34
  ```
35
-
36
- # Use Cases
37
- ## Data Privacy and Compliance
38
- This model is designed to screen for sensitive data and "Geschäftsgeheimnisse" (trade secrets) in text. By doing so, it helps organizations remain compliant with data privacy laws and reduces the risk of accidental exposure of confidential information.
39
- # Example Usage
40
- the prefix for the prompt is "Check for sensitive information: "
41
- "Check for sensitive information: <your text here>""
42
 
43
  # Dataset and Training Documentation for Audit
44
  If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
 
11
  The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
12
  ## Supported Languages
13
  Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
 
14
 
15
+
16
+ # Use Cases
17
+ ## Data Privacy and Compliance
18
+ This model is designed to screen for sensitive data and "Geschäftsgeheimnisse" (trade secrets) in text. By doing so, it helps organizations remain compliant with data privacy laws and reduces the risk of accidental exposure of confidential information.
19
+ # Example Usage
20
  ```python
21
  import torch
22
  from transformers import AutoTokenizer, AutoModelForCausalLM
23
  tokenizer = AutoTokenizer.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9")
24
  model = AutoModelForCausalLM.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9", torch_dtype=torch.bfloat16)
25
 
26
+ text_to_check = 'John, our patient, felt a throbbing headache and dizziness for two weeks. He was immediately...'
27
+
28
+ prompt = f"Check for sensitive information: {text_to_check}"
29
  inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
30
 
31
  max_length = 512
32
+ outputs = model.generate(inputs.input_ids, max_new_tokens=max_length)
 
 
 
33
 
34
  result = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
35
 
36
  print(result)
37
  ```
 
 
 
 
 
 
 
38
 
39
  # Dataset and Training Documentation for Audit
40
  If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.