Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 27, 2024

Commit

da4ab75

verified ·

1 Parent(s): 82981eb

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -62,16 +62,14 @@ Exploring Refusal Loss Landscapes </title>
   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
   detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
-  landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
   methods and show the defense performance.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
-<p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
-A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
-contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
-This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks
-(e.g., autonomous driving systems, medical diagnosis, etc.).</p>
 <div class="container">
 <div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">

   jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
  we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
   detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
+  landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
   methods and show the defense performance.
 </p>
 <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
+<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent
+  the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into
+  generating hate speech targeting certain groups of people, as demonstrated below.</p>
 <div class="container">
 <div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">