Update index.html
Browse files- index.html +3 -2
index.html
CHANGED
@@ -77,11 +77,12 @@ Exploring Refusal Loss Landscapes </title>
|
|
77 |
</div>
|
78 |
</div>
|
79 |
|
80 |
-
<h3 id="refusal-loss">Refusal Loss</h3>
|
81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
-
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss
|
|
|
85 |
landscape below:
|
86 |
</p>
|
87 |
|
|
|
77 |
</div>
|
78 |
</div>
|
79 |
|
80 |
+
<h3 id="refusal-loss">Refusal Loss Landscape Exploration</h3>
|
81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
82 |
autoregressive sampling-based generation. With this randomness, it is an
|
83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
84 |
+
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called Refusal Loss to represent the probability with which
|
85 |
+
the LLM won't reject the input user query and visualize its 2-d
|
86 |
landscape below:
|
87 |
</p>
|
88 |
|