File size: 16,283 Bytes
9fade2a bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd 7261a26 bfc5ccd b929465 bfc5ccd cf0d3f3 bfc5ccd b929465 bfc5ccd cf0d3f3 6d807bc 0c73edf bfc5ccd cf0d3f3 67c5ec1 ca768c1 bfc5ccd 7261a26 bfc5ccd 9fade2a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
<!DOCTYPE html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<!-- Begin Jekyll SEO tag v2.8.0 -->
<title>Gradient Cuff | Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by
Exploring Refusal Loss Landscapes </title>
<meta property="og:title" content="Gradient Cuff" />
<meta property="og:locale" content="en_US" />
<meta name="description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<meta property="og:description" content="Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" />
<script type="application/ld+json">
{"@context":"https://schema.org","@type":"WebSite","description":"Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes","headline":"Gradient Cuff","name":"Gradient Cuff","url":"https://huggingface.co/spaces/gregH/Gradient Cuff"}</script>
<!-- End Jekyll SEO tag -->
<link rel="preconnect" href="https://fonts.gstatic.com">
<link rel="preload" href="https://fonts.googleapis.com/css?family=Open+Sans:400,700&display=swap" as="style" type="text/css" crossorigin>
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<meta name="apple-mobile-web-app-status-bar-style" content="black-translucent">
<link rel="stylesheet" href="assets/css/bootstrap/bootstrap.min.css?v=90447f115a006bc45b738d9592069468b20e2551">
<link rel="stylesheet" href="assets/css/style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<!-- start custom head snippets, customize with your own _includes/head-custom.html file -->
<link rel="stylesheet" href="assets/css/custom_style.css?v=90447f115a006bc45b738d9592069468b20e2551">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<link rel="stylesheet" href="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/themes/smoothness/jquery-ui.css">
<script src="https://ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/Chart.js/2.9.4/Chart.js"></script>
<script src="assets/js/calibration.js?v=90447f115a006bc45b738d9592069468b20e2551"></script>
<!-- for mathjax support -->
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<!-- end custom head snippets -->
</head>
<body>
<a id="skip-to-content" href="#content">Skip to the content.</a>
<header class="page-header" role="banner">
<h1 class="project-name">Gradient Cuff</h1>
<h2 class="project-tagline">Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes</h2>
</header>
<main id="content" class="main-content" role="main">
<h2 id="introduction">Introduction</h2>
<p>Neural network calibration is an essential task in deep learning to ensure consistency
between the confidence of model prediction and the true correctness likelihood. In this
demonstration, we first visualize the idea of neural network calibration on a binary
classifier and show model features that represent its calibration. Second, we introduce
our proposed framework <strong>Neural Clamping</strong>, which employs a simple joint input-output
transformation on a pre-trained classifier. We also provide other calibration approaches
(e.g., temperature scaling) to compare with Neural Clamping.</p>
<h2 id="what-is-jailbreak">What is Calibration?</h2>
<p>Neural Network Calibration seeks to make model prediction align with its true correctness likelihood.
A well-calibrated model should provide accurate predictions and reliable confidence when making inferences. On the
contrary, a poor calibration model would have a wide gap between its accuracy and average confidence level.
This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks
(e.g., autonomous driving systems, medical diagnosis, etc.).</p>
<div class="container">
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
<img id="jailbreak-intro-img" src="https://hsiung.cc/NCTV/images/conf_acc_demo.gif" />
</div>
</div>
<h3 id="refusal-loss">Calibration Metrics</h3>
<p>Objectively, researchers utilize <strong>Calibration Metrics</strong> to measure the calibration error for a model, for example,
Expected Calibration Error (ECE), Static Calibration Error (SCE), Adaptive Calibration Error (ACE), etc.</p>
<div class="container jailbreak-intro-sec">
<div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>
</div>
<div id="refusal-loss-formula" class="container">
<div id="refusal-loss-formula-list" class="row align-items-center formula-list">
<a href="#ECE-formula" class="selected">Refusal Loss</a>
<a href="#SCE-formula">Refusal Loss Approximation</a>
<a href="#ACE-formula">Gradient Estimation</a>
<div style="clear: both"></div>
</div>
<div id="refusal-loss-formula-content" class="row align-items-center">
<span id="ECE-formula" class="formula" style="">$$\displaystyle \phi_\theta(x)=1-\mathbb{E}_{y \sim T_\theta(x)} JB(y)$$</span>
<span id="SCE-formula" class="formula" style="display: none;">$$\displaystyle f_\theta(x)=1-\frac{1}{N}\sum_{i=1}^N JB(y_i)$$</span>
<span id="ACE-formula" class="formula" style="display: none;">$$\displaystyle \text{ACE}=\frac{1}{KR}\sum_{k=1}^{K}\sum_{r=1}^{R}|\text{acc}(r,k)-\text{conf}(r,k)|$$</span>
</div>
</div>
<h2 id="proposed-approach-neural-clamping">Proposed Approach: Neural Clamping</h2>
<div class="container"><img id="calibration-header" src="images/header.png" /></div>
<h2 id="demonstration">Demonstration</h2>
<p>In the current research, a reliability diagram is drawn to show the calibration performance of a model. However, since
reliability diagrams often only provide fixed bar graphs statically, further explanation from the chart is limited. In
this demonstration, we show how to make reliability diagrams interactive and insightful to help researchers and
developers gain more insights from the graph. Specifically, we provide three CIFAR-100 classification models
in this demonstration. Multiple Bin numbers are also support</p>
<p>We hope this tool could also facilitate the development process.</p>
<div id="calibration-demo" class="container">
<div class="row align-items-center">
<div class="row" style="display: none;">
<div class="datasets-list">
<span style="margin-right: 1em;">Datasets</span>
<span class="radio-group"><input type="radio" id="CIFAR-100" class="options" name="datasets" value="cifar100" checked="" /><label for="CIFAR-100" class="option-label">CIFAR-100</label></span>
<span class="radio-group"><input type="radio" id="ImageNet" class="options" name="datasets" value="imagenet" /><label for="ImageNet" class="option-label">ImageNet</label></span>
</div>
</div>
<div class="row" style="margin: 10px 0 0">
<div class="models-list">
<span style="margin-right: 1em;">Models</span>
<span class="radio-group"><input type="radio" id="ResNet110" class="options" name="models" value="resnet110" checked="" /><label for="ResNet110" class="option-label">ResNet110</label></span>
<span class="radio-group"><input type="radio" id="DenseNet121" class="options" name="models" value="densenet121" /><label for="DenseNet121" class="option-label">DenseNet121</label></span>
<span class="radio-group"><input type="radio" id="WideResNet40-10" class="options" name="models" value="wideresnet40_10" /><label for="WideResNet40-10" class="option-label">WideResNet40-10</label></span>
</div>
</div>
</div>
<div class="row align-items-center">
<div class="col-4">
<div id="toolbox">
<div class="row align-items-center" style="margin-top: 2em;"><input type="radio" id="tool_none" class="options" name="calibration_tool" value="none" checked="" /><label for="tool_none" class="calibrate-tool">None</label></div>
<div class="row align-items-center"><input type="radio" id="tool_ts" class="options" name="calibration_tool" value="ts" /><label for="tool_ts" class="calibrate-tool">Temp. Scaling</label></div>
<div class="row align-items-center"><input type="radio" id="tool_delta" class="options" name="calibration_tool" value="delta" /><label for="tool_delta" class="calibrate-tool">Univ. Perturbation</label></div>
<div class="row align-items-center"><input type="radio" id="tool_neural_clamping" class="options" name="calibration_tool" value="neural_clamping" /><label for="tool_neural_clamping" class="calibrate-tool"><span style="font-weight: bold;">Neural Clamping</span></label></div>
</div>
<div class="row align-items-center">
<div class="legend"><img src="images/demo-legend.png" alt="legend" /></div>
<div class="figure-option"><label class="container" for="ActualOnly">Actual Only<input id="ActualOnly" type="checkbox" name="ActualOnly" value="Actual Only" onchange="figureOption()" /><span class="checkmark"></span></label></div>
</div>
<div class="row align-items-center">
<div class="calibration-error"><span class="calibration-metric">Expected Calibration Error</span><span class="calibration-error-value" id="ece-value">0.10731</span></div>
</div>
</div>
<div class="col-8">
<figure class="figure">
<img id="reliability-diagram" src="images/cifar100/resnet110/none/bin15.png" alt="CIFAR-100 Calibrated Reliability Diagram (Full)" />
<div class="slider-container">
<div class="slider-label"><span>Bin Number</span></div>
<div class="slider-content" id="bin-slider"><div id="bin-num" class="ui-slider-handle"></div></div>
</div>
<div class="slider-container">
<div class="slider-label"><span>Temp Scaling</span></div>
<div class="slider-content" id="ts-slider"><div id="temp-scale" class="slider-value ui-slider-handle"></div></div>
</div>
<figcaption class="figure-caption">
</figcaption>
</figure>
</div>
</div>
</div>
<h2 id="use-nctookit-to-calibrate-your-own-models">Use NCTookit to Calibrate Your Own Models</h2>
<p>Quick Start by running the following code! Or, <a href="https://colab.research.google.com/drive/1HosL29iJxK7Z8wNR9X3aWCgvbdu1ZgFu"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" /></a>.
Using this tool, users can use our proposed package, \(\texttt{NCTookit}\), to calibrate the model.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># !pip install -q git+https://github.com/yungchentang/NCToolkit.git
</span><span class="kn">from</span> <span class="nn">neural_clamping.nc_wrapper</span> <span class="kn">import</span> <span class="n">NCWrapper</span>
<span class="kn">from</span> <span class="nn">neural_clamping.utils</span> <span class="kn">import</span> <span class="n">load_model</span><span class="p">,</span> <span class="n">load_dataset</span><span class="p">,</span> <span class="n">model_classes</span><span class="p">,</span> <span class="n">plot_reliability_diagram</span>
<span class="c1"># Load model
</span><span class="n">model</span> <span class="o">=</span> <span class="n">load_model</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">'ARCHITECTURE'</span><span class="p">,</span> <span class="n">data</span><span class="o">=</span><span class="s">'DATASET'</span><span class="p">,</span> <span class="n">checkpoint_path</span><span class="o">=</span><span class="s">'CHECKPOINT_PATH'</span><span class="p">)</span>
<span class="n">num_classes</span> <span class="o">=</span> <span class="n">model_classes</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s">'DATASET'</span><span class="p">)</span>
<span class="c1"># Dataset loader
</span><span class="n">valloader</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s">'DATASET'</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">'val'</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="s">"BATCH_SIZE"</span><span class="p">)</span>
<span class="n">testloader</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="s">'DATASET'</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">'test'</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="s">"BATCH_SIZE"</span><span class="p">)</span>
<span class="c1"># Build Neural Clamping framework
</span><span class="n">nc</span> <span class="o">=</span> <span class="n">NCWrapper</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span> <span class="n">num_classes</span><span class="o">=</span><span class="n">num_classes</span><span class="p">,</span> <span class="p">...)</span>
<span class="c1"># Calibrated using Neural Clamping
</span><span class="n">nc</span><span class="p">.</span><span class="n">train_NC</span><span class="p">(</span><span class="n">val_loader</span><span class="o">=</span><span class="n">valloader</span><span class="p">,</span> <span class="n">epoch</span><span class="o">=</span><span class="s">'EPOCH'</span><span class="p">,</span> <span class="p">...)</span>
<span class="c1"># General Evaluation
</span><span class="n">nc</span><span class="p">.</span><span class="n">test_with_NC</span><span class="p">(</span><span class="n">test_loader</span><span class="o">=</span><span class="n">testloader</span><span class="p">)</span>
<span class="c1"># Visualization
</span><span class="n">bin_acc</span><span class="p">,</span> <span class="n">conf_axis</span><span class="p">,</span> <span class="n">ece_score</span> <span class="o">=</span> <span class="n">nc</span><span class="p">.</span><span class="n">reliability_diagram</span><span class="p">(</span><span class="n">test_loader</span><span class="o">=</span><span class="n">testloader</span><span class="p">,</span> <span class="n">rd_criterion</span><span class="o">=</span><span class="s">"ECE"</span><span class="p">,</span> <span class="n">n_bins</span><span class="o">=</span><span class="mi">30</span><span class="p">)</span>
<span class="n">plot_reliability_diagram</span><span class="p">(</span><span class="n">conf_axis</span><span class="p">,</span> <span class="n">bin_acc</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="citations">Citations</h2>
<p>If you find Neural Clamping helpful and useful for your research, please cite our main paper as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{hsiung2023nctv,
title={{NCTV: Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes}},
author={Lei Hsiung, Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
booktitle={Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence},
publisher={Association for the Advancement of Artificial Intelligence},
year={2023},
month={February}
}
@misc{tang2022neural_clamping,
title={{Neural Clamping: Joint Input Perturbation and Temperature Scaling for Neural Network Calibration}},
author={Yung-Chen Tang and Pin-Yu Chen and Tsung-Yi Ho},
year={2022},
eprint={2209.11604},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
</code></pre></div></div>
<footer class="site-footer">
<span class="site-footer-owner">NCTV is maintained by <a href="https://hsiung.cc">Lei Hsiung</a> and <a href="https://github.com/yungchentang">Yung-Chen Tang</a>.</span>
</footer>
</main>
</body>
</html>
|