File size: 11,382 Bytes
9fade2a bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 2654ca5 bfc5ccd 2654ca5 bfc5ccd b929465 bfc5ccd cf0d3f3 bfc5ccd b929465 bfc5ccd cf0d3f3 6d807bc 0c73edf bfc5ccd cf0d3f3 67c5ec1 ca768c1 0b1f8c9 bfc5ccd f420005 bfc5ccd f420005 bfc5ccd 4b4042c bfc5ccd fd620cd bfc5ccd 6799636 232a1d9 0730956 1166ace 0730956 bfc5ccd c9fcacb 7ef77c5 c9fcacb bfc5ccd c9fcacb bfc5ccd ab9235e 0730956 bfc5ccd ab9235e 19975bf bfc5ccd 7261a26 bfc5ccd 9fade2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes </title>
<meta property="og:title" content="Gradient Cuff" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<meta property="og:description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"WebSite","description":"Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script>
<!-- End Jekyll SEO tag -->
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551">
<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<!-- start custom head snippets, customize with your own _includes/head-custom.html file -->
<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script>
<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script>
<!-- for mathjax support -->
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<!-- end custom head snippets -->
</head>
<body>
<a id="skip-to-content" href="#content">Skip to the content.</a>
<header class="page-header" role="banner">
<h1 class="project-name">Gradient Cuff</h1>
<h2 class="project-tagline">Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes</h2>
</header>
<main id="content" class="main-content" role="main">
<h2 id="introduction">Introduction</h2>
<p>Large Language Models (LLMs) are becoming a prominent generative AI tool, where the user enters a
query and the LLM generates an answer. To reduce harm and misuse, efforts have been made to align
these LLMs to human values using advanced training techniques such as Reinforcement Learning from
Human Feedback (RLHF). However, recent studies have highlighted the vulnerability of LLMs to adversarial
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
landscape and based on the characteristics of this landscape to propose the Gradient Cuff. Lastly, we compare it with other jailbreak defense
methods and show the defense performance.
</p>
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
<p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks
(e.g., autonomous driving systems, medical diagnosis, etc.).</p>
<div class="container">
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
<img id="jailbreak-intro-img" src="https://hsiung.cc/NCTV/images/conf_acc_demo.gif" />
</div>
</div>
<h3 id="refusal-loss">Calibration Metrics</h3>
<p>Objectively, researchers utilize <strong>Calibration Metrics</strong> to measure the calibration error for a model, for example,
Expected Calibration Error (ECE), Static Calibration Error (SCE), Adaptive Calibration Error (ACE), etc.</p>
<div class="container jailbreak-intro-sec">
<div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>
</div>
<div id="refusal-loss-formula" class="container">
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
<a href="#ECE-formula" class="selected">Refusal Loss</a>
<a href="#SCE-formula">Refusal Loss Approximation</a>
<a href="#ACE-formula">Gradient Estimation</a>
<div style="clear: both"></div>
</div>
<div id="refusal-loss-formula-content" class="row align-items-center">
<span id="ECE-formula" class="formula" style="">$$\displaystyle \phi_\theta(x)=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)$$</span>
<span id="SCE-formula" class="formula" style="display: none;">$$\displaystyle f_\theta(x)=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)$$</span>
<span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle g_\theta(x)=\sum_{i=1}^P \frac{f_\theta(x\oplus \mu u_i)-f_\theta(x)}{\mu} u_i $$</span>
</div>
</div>
<h2 id="proposed-approach-gradient-cuff">Proposed Approach: Gradient Cuff</h2>
<div class="container"><img id="gradient-cuff-header" src="images/header.png" /></div>
<h2 id="demonstration">Demonstration</h2>
<p>In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since
reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In
this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and
developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models
in this demonstration. Multiple Bin numbers are also support</p>
<p>We hope this tool could also facilitate the development process.</p>
<div id="jailbreak-demo" class="container">
<div class="row align-items-center">
<div class="row" style="margin: 10px 0 0">
<div class="models-list">
<span style="margin-right: 1em;">Models</span>
<span class="radio-group"><input type="radio" id="LLaMA2" class="options" name="models" value="llama2_7b_chat" checked="" /><label for="LLaMA2" class="option-label">LLaMA-2-7B-Chat</label></span>
<span class="radio-group"><input type="radio" id="Vicuna" class="options" name="models" value="vicuna_7b_v1.5" /><label for="Vicuna" class="option-label">Vicuna-7B-V1.5</label></span>
</div>
</div>
</div>
<div class="row align-items-center">
<div class="col-4">
<div id="defense-methods">
<div class="row align-items-center"><input type="radio" id="defense_ppl" class="options" name="defense" value="ppl" /><label for="defense_ppl" class="defense">Perplexity Filter</label></div>
<div class="row align-items-center"><input type="radio" id="defense_smoothllm" class="options" name="defense" value="smoothllm" /><label for="defense_smoothllm" class="defense">SmoothLLM</label></div>
<div class="row align-items-center"><input type="radio" id="defense_erase_check" class="options" name="defense" value="erase_check" /><label for="defense_erase_check" class="defense">Erase-Check</label></div>
<div class="row align-items-center"><input type="radio" id="defense_self_reminder" class="options" name="defense" value="self_reminder" /><label for="defense_self_reminder" class="defense">Self-Reminder</label></div>
<div class="row align-items-center"><input type="radio" id="defense_gradient_cuff" class="options" name="defense" value="gradient_cuff" /><label for="defense_gradient_cuff" class="defense"><span style="font-weight: bold;">Gradient Cuff</span></label></div>
</div>
<div class="row align-items-center">
<div class="legend"><img src="images/demo-legend.png" alt="legend" /></div>
</div>
<div class="row align-items-center">
<div class="attack-success-rate"><span class="jailbreak-metric">Average Malicious Refusal Rate</span><span class="attack-success-rate-value" id="asr-value">0.95875</span></div>
</div>
<div class="row align-items-center">
<div class="benign-refusal-rate"><span class="jailbreak-metric">Benign Refusal Rate</span><span class="benign-refusal-rate-value" id="brr-value">0.05000</span></div>
</div>
</div>
<div class="col-8">
<figure class="figure">
<img id="reliability-diagram" src="demo_results/gradient_cuff_llama2_7b_chat_threshold_100.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
<div class="slider-container">
<div class="slider-label"><span>Perplexity Threshold</span></div>
<div class="slider-content" id="ppl-slider"><div id="ppl-threshold" class="ui-slider-handle"></div></div>
</div>
<div class="slider-container">
<div class="slider-label"><span>Gradient Threshold</span></div>
<div class="slider-content" id="gradient-norm-slider"><div id="gradient-norm-threshold" class="slider-value ui-slider-handle"></div></div>
</div>
<figcaption class="figure-caption">
</figcaption>
</figure>
</div>
</div>
</div>
<h2 id="citations">Citations</h2>
<p>If you find Neural Clamping helpful and useful for your research, please cite our main paper as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{hsiung2023nctv,
title={{NCTV: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}},
author={Lei Hsiung, Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
booktitle={Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence},
publisher={Association for the Advancement of Artificial Intelligence},
year={2023},
month={February}
}
@misc{tang2022neural_clamping,
title={{Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration}},
author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
year={2022},
eprint={2209.11604},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
</code></pre></div></div>
<footer class="site-footer">
<span class="site-footer-owner">NCTV is maintained by <a href="https://hsiung.cc">Lei Hsiung</a> and <a href="https://github.com/yungchentang">Yung-Chen Tang</a>.</span>
</footer>
</main>
</body>
</html>
|