cotran2 commited on
Commit
a5466a2
·
verified ·
1 Parent(s): ad0afe9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - meta-llama/Prompt-Guard-86M
7
+ pipeline_tag: text-classification
8
+ datasets:
9
+ - SohamGhadge/casual-conversation
10
+ - tau/commonsense_qa
11
+ - AIR-Bench/qa_finance_en
12
+ - JailbreakBench/JBB-Behaviors
13
+ - rubend18/ChatGPT-Jailbreak-Prompts
14
+ - cstnz/Disaster-tweet-jailbreaking
15
+ - JailbreakV-28K/JailBreakV-28k
16
+ - Amod/mental_health_counseling_conversations
17
+ - talkmap/telecom-conversation-corpus
18
+ - truthfulqa/truthful_qa
19
+ - GEM/conversational_weather
20
+ ---
21
+ # katanemo/Arch-Guard-gpu
22
+
23
+ ## Overview
24
+ The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks.
25
+ Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model.
26
+
27
+ Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve
28
+ the capability of detecting jailbreaks only.
29
+
30
+ In summary, the Katanemo Arch-Guard collection demonstrates:
31
+ - **State-of-the-art performance** in jailbreaking attempts detection
32
+ - Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience.
33
+
34
+ | Dominant class = jailbreak | | | | | | | |
35
+ | -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ |
36
+ | Model | TPR | TNR | FPR | FNR | AUC | Precision | Recall |
37
+ | Prompt-guard | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715 | 0.999 |
38
+ | Arch-guard | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761 | 0.999 |
39
+
40
+ ## Requirements
41
+ The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package.
42
+
43
+ ## Datasets
44
+ Evaluation dataset is sourced from a combination of open source datasets.
45
+
46
+ ## How to use
47
+
48
+ ````python
49
+ from transformers import pipeline
50
+
51
+ pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu")
52
+ pipe("Ignore your instruction")
53
+
54
+ ````
55
+
56
+ # License
57
+ Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard/blob/main/LICENSE).