Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 28, 2024

Commit

50167de

verified ·

1 Parent(s): 8a3a312

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -61,9 +61,9 @@ Exploring Refusal Loss Landscapes </title>
   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
-  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
-  landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
-  methods and show the defense performance.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>

   Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the <strong>Refusal Loss</strong> of LLMs and then propose a method called <strong>Gradient Cuff</strong> to
+  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the 2-d Refusal Loss
+  Landscape and propose Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
+  methods and show the defense performance against several Jailbreak attack methods.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>