update more results
Browse files- index.html +25 -4
index.html
CHANGED
@@ -42,6 +42,8 @@
|
|
42 |
<link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
|
43 |
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
|
44 |
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
|
|
|
|
|
45 |
<script>
|
46 |
$( function() {
|
47 |
$( "#tabs" ).tabs();
|
@@ -615,9 +617,9 @@
|
|
615 |
<div class="container-centered">
|
616 |
<div class="row">
|
617 |
<div class="col-md-10 col-md-offset-1">
|
618 |
-
<
|
619 |
Demo:
|
620 |
-
</
|
621 |
<div class="text-justify">
|
622 |
We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
|
623 |
</div>
|
@@ -704,13 +706,32 @@
|
|
704 |
</div>
|
705 |
</div>
|
706 |
</section>
|
707 |
-
|
708 |
<section class="section">
|
709 |
<div class="container is-max-desktop">
|
710 |
<div class="columns is-centered">
|
711 |
<div class="container-centered">
|
712 |
-
<h2 class="title is-3">
|
713 |
<div class="content has-text-justified">
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
714 |
</div>
|
715 |
</div>
|
716 |
</div>
|
|
|
42 |
<link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
|
43 |
<script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
|
44 |
<script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
|
45 |
+
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
46 |
+
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
|
47 |
<script>
|
48 |
$( function() {
|
49 |
$( "#tabs" ).tabs();
|
|
|
617 |
<div class="container-centered">
|
618 |
<div class="row">
|
619 |
<div class="col-md-10 col-md-offset-1">
|
620 |
+
<h2 id="Demo">
|
621 |
Demo:
|
622 |
+
</h2>
|
623 |
<div class="text-justify">
|
624 |
We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
|
625 |
</div>
|
|
|
706 |
</div>
|
707 |
</div>
|
708 |
</section>
|
709 |
+
<!-- Results -->
|
710 |
<section class="section">
|
711 |
<div class="container is-max-desktop">
|
712 |
<div class="columns is-centered">
|
713 |
<div class="container-centered">
|
714 |
+
<h2 class="title is-3">Results</h2>
|
715 |
<div class="content has-text-justified">
|
716 |
+
<p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
|
717 |
+
and MISTRAL-7B-Instruct-v0.2.</p>
|
718 |
+
<h2>Evaluation Metrics:</h2>
|
719 |
+
<ul>
|
720 |
+
<li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.
|
721 |
+
The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
|
722 |
+
<p><b>ASR</b> is defined as:</p>
|
723 |
+
<p>\[
|
724 |
+
\textbf{ASR} = \frac{\text{Number\_of\_jailbreak\_queries}}{\text{Total\_queries}}
|
725 |
+
\]</p>
|
726 |
+
<p>Here the \(\text{Number\_of\_jailbreak\_queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
|
727 |
+
<p>The function to determine if a response is jailbroken can be expressed as:</p>
|
728 |
+
<p>\[
|
729 |
+
\text{JailBroken}(\text{response}) = \begin{cases}
|
730 |
+
1, & \text{if response contains any keyword;} \\
|
731 |
+
0, & \text{otherwise.}
|
732 |
+
\end{cases}
|
733 |
+
\]</p>
|
734 |
+
</ul>
|
735 |
</div>
|
736 |
</div>
|
737 |
</div>
|