Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 27

Commit

2654ca5

•

1 Parent(s): c9fcacb

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -55,15 +55,18 @@ Exploring Refusal Loss Landscapes </title>
     <main id="content" class="main-content" role="main">
       <h2 id="introduction">Introduction</h2>
-<p>Neural network calibration is an essential task in deep learning to ensure consistency
-between the confidence of model prediction and the true correctness likelihood. In this
-demonstration, we first visualize the idea of neural network calibration on a binary
-classifier and show model features that represent its calibration. Second, we introduce
-our proposed framework <strong>Neural Clamping</strong>, which employs a simple joint input-output
-transformation on a pre-trained classifier. We also provide other calibration approaches
-(e.g., temperature scaling) to compare with Neural Clamping.</p>
-<h2 id="what-is-jailbreak">What is Calibration?</h2>
 <p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
 A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
 contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.

     <main id="content" class="main-content" role="main">
       <h2 id="introduction">Introduction</h2>
+<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
+  query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
+  these LLMs to human values using advanced training techniques such as Reinforcement Learning from
+  Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
+  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
+ we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
+  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
+  landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
+  methods and show the defense performance.
+</p>
+<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
 <p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
 A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
 contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.