katanemo/Arch-Guard-cpu

Overview

The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for jailbreaking detection tasks. Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model.

Arch Guard is a classifier model fine-tuned based on the open source model Prompt-Guard-86M on a collection of open-source datasets of jailbreaking attemps with an intention to improve the capability of detecting jailbreaks only.

In summary, the Katanemo Arch-Guard collection demonstrates:

  • State-of-the-art performance in jailbreaking attempts detection
  • Optimized low-latency, low False Positive Rate, making it suitable for real-time, production environments, and best user experience.
Dominant class = jailbreak
Model TPR TNR FPR FNR AUC Precision Recall
Prompt-guard 0.8468 0.9972 0.0028 0.1532 0.857 0.715 0.999
Arch-guard 0.8887 0.9970 0.0030 0.1113 0.880 0.761 0.999

Requirements

The cpu model is quantized with OVM, please follow the instruction at https://github.com/huggingface/optimum-intel to install the package.

Datasets

Evaluation dataset is from casual_conversation casual_conversation commonqa financeqa
instruction
jailbreak_behavior_benign
jailbreak_behavior_harmful
jailbreak_judge
jailbreak_prompts
jailbreak_tweet
jailbreak_v
jailbreak_vigil
mental_health telecom
truthqa weather

How to use

from optimum.intel import OVModelForSequenceClassification

device = "cpu"
model_name = "katanemolabs/Arch-Guard-cpu"
guard_mode = OVModelForSequenceClassification.from_pretrained(
    model_name, device_map=device, low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(
        model_name, trust_remote_code=True
)

License

Katanemo Arch-Guard-cpu is distributed under the Katanemo license.

Downloads last month
2,719
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for katanemo/Arch-Guard-cpu

Finetuned
(3)
this model

Datasets used to train katanemo/Arch-Guard-cpu

Collection including katanemo/Arch-Guard-cpu