Update index.html
Browse files- index.html +12 -9
index.html
CHANGED
@@ -55,15 +55,18 @@ Exploring Refusal Loss Landscapes </title>
|
|
55 |
<main id="content" class="main-content" role="main">
|
56 |
<h2 id="introduction">Introduction</h2>
|
57 |
|
58 |
-
<p>
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
|
|
|
|
|
|
67 |
<p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
|
68 |
A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
|
69 |
contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
|
|
|
55 |
<main id="content" class="main-content" role="main">
|
56 |
<h2 id="introduction">Introduction</h2>
|
57 |
|
58 |
+
<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
|
59 |
+
query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
|
60 |
+
these LLMs to human values using advanced training techniques such as Reinforcement Learning from
|
61 |
+
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
|
62 |
+
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
63 |
+
we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
|
64 |
+
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
|
65 |
+
landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
|
66 |
+
methods and show the defense performance.
|
67 |
+
</p>
|
68 |
+
|
69 |
+
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
70 |
<p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
|
71 |
A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
|
72 |
contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
|