Update README.md
Browse files
README.md
CHANGED
@@ -11,34 +11,30 @@ This Natural Language Processing (NLP) model is made available under the Apache
|
|
11 |
The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
|
12 |
## Supported Languages
|
13 |
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
|
14 |
-
## Example Usage
|
15 |
|
|
|
|
|
|
|
|
|
|
|
16 |
```python
|
17 |
import torch
|
18 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
19 |
tokenizer = AutoTokenizer.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9")
|
20 |
model = AutoModelForCausalLM.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9", torch_dtype=torch.bfloat16)
|
21 |
|
22 |
-
|
|
|
|
|
23 |
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
|
24 |
|
25 |
max_length = 512
|
26 |
-
|
27 |
-
max_new_tokens_value = max_length - inputs_length
|
28 |
-
|
29 |
-
outputs = model.generate(inputs.input_ids, max_new_tokens=max_new_tokens_value, do_sample=False, top_k=50, top_p=0.98)
|
30 |
|
31 |
result = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
|
32 |
|
33 |
print(result)
|
34 |
```
|
35 |
-
|
36 |
-
# Use Cases
|
37 |
-
## Data Privacy and Compliance
|
38 |
-
This model is designed to screen for sensitive data and "Geschäftsgeheimnisse" (trade secrets) in text. By doing so, it helps organizations remain compliant with data privacy laws and reduces the risk of accidental exposure of confidential information.
|
39 |
-
# Example Usage
|
40 |
-
the prefix for the prompt is "Check for sensitive information: "
|
41 |
-
"Check for sensitive information: <your text here>""
|
42 |
…
|
43 |
# Dataset and Training Documentation for Audit
|
44 |
If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
|
|
|
11 |
The model is optimized to analyze texts containing up to 512 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 512 tokens. Each chunk can then be processed separately.
|
12 |
## Supported Languages
|
13 |
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
|
|
|
14 |
|
15 |
+
|
16 |
+
# Use Cases
|
17 |
+
## Data Privacy and Compliance
|
18 |
+
This model is designed to screen for sensitive data and "Geschäftsgeheimnisse" (trade secrets) in text. By doing so, it helps organizations remain compliant with data privacy laws and reduces the risk of accidental exposure of confidential information.
|
19 |
+
# Example Usage
|
20 |
```python
|
21 |
import torch
|
22 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
23 |
tokenizer = AutoTokenizer.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9")
|
24 |
model = AutoModelForCausalLM.from_pretrained("metricspace/DataPrivacyComplianceCheck-3B-V0.9", torch_dtype=torch.bfloat16)
|
25 |
|
26 |
+
text_to_check = 'John, our patient, felt a throbbing headache and dizziness for two weeks. He was immediately...'
|
27 |
+
|
28 |
+
prompt = f"Check for sensitive information: {text_to_check}"
|
29 |
inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
|
30 |
|
31 |
max_length = 512
|
32 |
+
outputs = model.generate(inputs.input_ids, max_new_tokens=max_length)
|
|
|
|
|
|
|
33 |
|
34 |
result = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
|
35 |
|
36 |
print(result)
|
37 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
…
|
39 |
# Dataset and Training Documentation for Audit
|
40 |
If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
|