gregH commited on
Commit
da4ab75
·
verified ·
1 Parent(s): 82981eb

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +4 -6
index.html CHANGED
@@ -62,16 +62,14 @@ Exploring Refusal Loss Landscapes </title>
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
64
  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
- landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
66
  methods and show the defense performance.
67
  </p>
68
 
69
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
70
- <p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
71
- A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
72
- contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
73
- This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks
74
- (e.g., autonomous driving systems, medical diagnosis, etc.).</p>
75
 
76
  <div class="container">
77
  <div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
 
62
  jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
63
  we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
64
  detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
65
+ landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
66
  methods and show the defense performance.
67
  </p>
68
 
69
  <h2 id="what-is-jailbreak">What is Jailbreak?</h2>
70
+ <p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent
71
+ the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into
72
+ generating hate speech targeting certain groups of people, as demonstrated below.</p>
 
 
73
 
74
  <div class="container">
75
  <div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">