Spaces:

TrustSafeAI
/

Defensive-Prompt-Patch-Jailbreak-Defense

Running

App Files Files Community

bxiong commited on May 30

Commit

33fbcd5

•

1 Parent(s): ccca66f

update more results

Browse files

Files changed (1) hide show

index.html +25 -4

index.html CHANGED Viewed

@@ -42,6 +42,8 @@
   <link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
   <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
   <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
   <script>
   $( function() {
     $( "#tabs" ).tabs();
@@ -615,9 +617,9 @@
 <div class="container-centered">
 <div class="row">
   <div class="col-md-10 col-md-offset-1">
-  <h3 id="Demo">
   Demo:
-  </h3>
   <div class="text-justify">
   We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
   </div>
@@ -704,13 +706,32 @@
 </div>
 </div>
 </section>
 <section class="section">
 <div class="container is-max-desktop">
 <div class="columns is-centered">
 <div class="container-centered">
-  <h2 class="title is-3">Abstract</h2>
         <div class="content has-text-justified">
 </div>
 </div>
 </div>

   <link rel="stylesheet" href="https://code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">
   <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
   <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.min.js"></script>
+  <script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
+  <script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
   <script>
   $( function() {
     $( "#tabs" ).tabs();
 <div class="container-centered">
 <div class="row">
   <div class="col-md-10 col-md-offset-1">
+  <h2 id="Demo">
   Demo:
+  </h2>
   <div class="text-justify">
   We present a few jailbreak examples of the performance of our trained DPPs under both LLAMA-2-7B-Chat and MISTRAL-7B-Instruct-v0.2 models. <span class="red-text">Note that some of the response contents contain harmful information.</span>
   </div>
 </div>
 </div>
 </section>
+<!-- Results -->
 <section class="section">
 <div class="container is-max-desktop">
 <div class="columns is-centered">
 <div class="container-centered">
+  <h2 class="title is-3">Results</h2>
         <div class="content has-text-justified">
+            <p>In this section we want to show our <strong>numerical results</strong> as well as <strong>our trained DPP</strong> on both LLAMA-2-Chat
+              and MISTRAL-7B-Instruct-v0.2.</p>
+          <h2>Evaluation Metrics:</h2>
+          <ul>
+            <li><strong>Attack Success Rate:</strong>We use the Attack Success Rate (ASR) as our primary metric for evaluating the effectiveness of jailbreak defenses.
+              The ASR measures the proportion of malicious queries that successfully bypass the LLMs alignment and generate harmful responses.</li>
+            <p><b>ASR</b> is defined as:</p>
+    <p>\[
+    \textbf{ASR} = \frac{\text{Number\_of\_jailbreak\_queries}}{\text{Total\_queries}}
+    \]</p>
+    <p>Here the \(\text{Number\_of\_jailbreak\_queries}\) is calculated through the sub-strings matching. Specifically, for a given generated response of a jailbreak query, if the response contains sub-strings that exist in the pre-defined sub-string set \(S\). Then, it will be evaluated as <b>jailbroken</b>, otherwise it is <b>non-jailbroken</b>.</p>
+    <p>The function to determine if a response is jailbroken can be expressed as:</p>
+    <p>\[
+    \text{JailBroken}(\text{response}) = \begin{cases}
+    1, & \text{if response contains any keyword;} \\
+    0, & \text{otherwise.}
+    \end{cases}
+    \]</p>
+          </ul>
 </div>
 </div>
 </div>