janraasch commited on
Commit
5b51887
0 Parent(s):

Initial commit

Browse files
Files changed (5) hide show
  1. .gitattributes +35 -0
  2. .gitignore +1 -0
  3. README.md +55 -0
  4. app.py +335 -0
  5. requirements.txt +4 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1 @@
 
 
1
+ gradio-env
README.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Activate Love
3
+ emoji: ❤️
4
+ colorFrom: purple
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 4.31.5
8
+ app_file: app.py
9
+ pinned: true
10
+ license: mit
11
+ short_description: Steering AI Text Generation
12
+ ---
13
+
14
+ # Activate Love ❤️
15
+
16
+ A [Gradio App][gradio-url] replicating results of the paper [»Activation Addition: Steering Language Models Without Optimization«][paper-url] on a [Hugging Face Space][hugging-face-spaces-url].
17
+
18
+ ## Demo
19
+
20
+ Check it out https://huggingface.co/spaces/janraasch/activate-love 🎯.
21
+
22
+ ## Raison d'être
23
+
24
+ This is my final project for the [AI Safety Fundamentals][ai-safety-fundamentals-url] course on [AI Alignment][ai-safety-fundamentals-alignment-url].
25
+
26
+ When we covered the topic of *Mechanistic Interpretability* in session six my cohort's instructor mentioned [the paper on activation addition][paper-url] published in late 2023. I found this to be an enjoyable & interesting way to get to play around with the inner workings of a model w/o training/optimization.
27
+
28
+ The authors kindly provide [a notebook on Google Colab][notebook-url] for everyone to replicate their results. Still, I felt it to be useful to give an even more user-friendly & non-technical interface to lower the barrier to interaction with these low-level workings of the model.
29
+
30
+ Hence this https://huggingface.co/spaces/janraasch/activate-love app exists such that *everyone* may steer and play with [GPT-2 XL][gpt2-xl-url].
31
+
32
+ ## Development
33
+
34
+ ```bash
35
+ # Create virtual environment
36
+ python3 -m venv gradio-env
37
+ source gradio-env/bin/activate
38
+
39
+ # Install dependencies
40
+ pip install -r requirements.txt
41
+
42
+ # Run app locally
43
+ gradio app.py
44
+ ```
45
+
46
+ ## License
47
+ [MIT License](https://en.wikipedia.org/wiki/MIT_License) © [Jan Raasch](https://www.janraasch.com)
48
+
49
+ [ai-safety-fundamentals-alignment-url]: https://aisafetyfundamentals.com/alignment
50
+ [ai-safety-fundamentals-url]: https://aisafetyfundamentals.com
51
+ [gpt2-xl-url]:https://huggingface.co/openai-community/gpt2-xl
52
+ [gradio-url]: https://www.gradio.app
53
+ [hugging-face-spaces-url]: https://huggingface.co/spaces/launch
54
+ [paper-url]: https://arxiv.org/abs/2308.10248
55
+ [notebook-url]: http://tinyurl.com/actadd
app.py ADDED
@@ -0,0 +1,335 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spaces
2
+ import gradio as gr
3
+
4
+ import time
5
+ import torch
6
+ from transformer_lens import HookedTransformer
7
+ from typing import List
8
+
9
+ # Save memory
10
+ torch.set_grad_enabled(False)
11
+
12
+ # Mock model for faster UI testing & feedback
13
+ UI_DEVELOPMENT = False
14
+
15
+ if not UI_DEVELOPMENT:
16
+ model = HookedTransformer.from_pretrained("gpt2-xl")
17
+ model.eval()
18
+ if torch.cuda.is_available():
19
+ model.to("cuda")
20
+ else:
21
+ model = "toy" # :)
22
+
23
+ SEED = 0
24
+ sampling_kwargs = dict(temperature=1.0, top_p=0.3, freq_penalty=1.0)
25
+ example_count = 4
26
+
27
+
28
+ def get_token_length(prompt):
29
+ return model.to_tokens(prompt).shape[1]
30
+
31
+
32
+ def add_padding_right(prompt, length):
33
+ return prompt + " " * (length - get_token_length(prompt))
34
+
35
+
36
+ def add_padding(prompt_add, prompt_sub):
37
+ padding_size = max(get_token_length(prompt_add), get_token_length(prompt_sub))
38
+ return add_padding_right(prompt_add, padding_size), add_padding_right(
39
+ prompt_sub, padding_size
40
+ )
41
+
42
+
43
+ def get_resid_pre(prompt: str, layer: int):
44
+ name = f"blocks.{layer}.hook_resid_pre"
45
+ cache, caching_hooks, _ = model.get_caching_hooks(lambda n: n == name)
46
+ with model.hooks(fwd_hooks=caching_hooks):
47
+ _ = model(prompt)
48
+ return cache[name]
49
+
50
+
51
+ def get_activations(prompt_add: str, prompt_sub: str, layer: int):
52
+ act_add = get_resid_pre(prompt_add, layer)
53
+ act_sub = get_resid_pre(prompt_sub, layer)
54
+ act_diff = act_add - act_sub
55
+
56
+ print("Activation Difference:")
57
+ print(act_diff.shape)
58
+
59
+ return act_diff
60
+
61
+
62
+ def create_hook(act_diff: torch.Tensor, coeff: int):
63
+ def ave_hook(resid_pre, hook):
64
+ if resid_pre.shape[1] == 1:
65
+ return # caching in model.generate for new tokens
66
+
67
+ # We only add to the prompt (first call), not the generated tokens.
68
+ ppos, apos = resid_pre.shape[1], act_diff.shape[1]
69
+
70
+ if apos > ppos:
71
+ raise gr.Error(
72
+ f"More mod tokens ({apos}) then PROMPT tokens ({ppos}). Try a **longer** PROMPT."
73
+ )
74
+
75
+ # add to the beginning (position-wise) of the activations
76
+ resid_pre[:, :apos, :] += coeff * act_diff
77
+
78
+ return ave_hook
79
+
80
+
81
+ def hooked_generate(prompt_batch: List[str], fwd_hooks=[], seed=None, **kwargs):
82
+ if seed is not None:
83
+ torch.manual_seed(seed)
84
+
85
+ with model.hooks(fwd_hooks=fwd_hooks):
86
+ tokenized = model.to_tokens(prompt_batch)
87
+ r = model.generate(input=tokenized, max_new_tokens=50, do_sample=True, **kwargs)
88
+ return r
89
+
90
+
91
+ def config_to_str(prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input):
92
+ if no_steering_input:
93
+ return "NO STEERING: TRUE"
94
+ return f"""PROMPT: {prompt}
95
+ FROM: {prompt_sub}
96
+ TO: {prompt_add}
97
+ MULTIPLIER: {coeff}
98
+ LAYER: {act_name}"""
99
+
100
+
101
+ def config_header_str():
102
+ return f"{'='*8} CONFIGURATION {'='*8}"
103
+
104
+
105
+ def sample_header_str(i: int):
106
+ return f"{'='*11} SAMPLE {i+1} {'='*11}"
107
+
108
+
109
+ def results_to_ui_output(
110
+ results, prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
111
+ ):
112
+ config_str = config_to_str(
113
+ prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
114
+ )
115
+ header_str = f"{config_header_str()}\n\n{config_str}"
116
+ body_str = "\n\n".join(
117
+ [f"{sample_header_str(i)}\n\n{r}" for i, r in enumerate(results)]
118
+ )
119
+ return f"{header_str}\n\n{body_str}"
120
+
121
+
122
+ @spaces.GPU
123
+ def predict(
124
+ prompt: str,
125
+ prompt_sub: str = "",
126
+ prompt_add: str = "",
127
+ coeff: int = 12,
128
+ act_name: int = 6,
129
+ no_steering_input: bool = False,
130
+ ):
131
+ if prompt_sub == "":
132
+ raise gr.Error(
133
+ "Please input a FROM option. Could be a single space character, a word or a phrase"
134
+ )
135
+ if prompt_add == "":
136
+ raise gr.Error(
137
+ "Please input a TO option. Could be a single space character, a word or a phrase"
138
+ )
139
+
140
+ print("Text generation begin:")
141
+ time_stamp = time.time()
142
+ print("Parameters:")
143
+ print("prompt:", prompt)
144
+ print("prompt_sub:", prompt_sub)
145
+ print("prompt_add:", prompt_add)
146
+ print("coeff:", coeff)
147
+ print("act_name:", act_name)
148
+ print("no_steering_input:", no_steering_input)
149
+
150
+ if not UI_DEVELOPMENT and not no_steering_input:
151
+ padded_prompt_add, padded_prompt_sub = add_padding(prompt_add, prompt_sub)
152
+ act_diff = get_activations(padded_prompt_add, padded_prompt_sub, act_name)
153
+ ave_hook = create_hook(act_diff, coeff)
154
+ editing_hooks = [(f"blocks.{act_name}.hook_resid_pre", ave_hook)]
155
+ res = hooked_generate(
156
+ [prompt] * example_count, editing_hooks, seed=SEED, **sampling_kwargs
157
+ )
158
+
159
+ # Remove beginning of sequence token
160
+ res_str = model.to_string(res[:, 1:])
161
+ else:
162
+ if not UI_DEVELOPMENT and no_steering_input:
163
+ res_str = hooked_generate(
164
+ [prompt] * example_count, [], seed=SEED, **sampling_kwargs
165
+ )
166
+
167
+ # Remove beginning of sequence token
168
+ res_str = model.to_string(res_str[:, 1:])
169
+ else:
170
+ res_str = [
171
+ "To visit the Berlin wall people have to go to the wall.",
172
+ "To visit the Berlin wall people have to go to a museum.",
173
+ ]
174
+
175
+ ui_result = results_to_ui_output(
176
+ res_str, prompt, prompt_sub, prompt_add, coeff, act_name, no_steering_input
177
+ )
178
+
179
+ print(f"Text generation end after {time.time() - time_stamp:.2f} seconds:")
180
+ print(ui_result)
181
+
182
+ return ui_result
183
+
184
+
185
+ options_accordion = gr.Accordion(label="Steering Options", open=True)
186
+
187
+ prompt_sub_input = gr.Textbox(
188
+ lines=1,
189
+ label="FROM",
190
+ info='Enter a prompt that you want to steer the AI output away from. \
191
+ This can be a single word or a whole phrase. E.g. \
192
+ "The Berlin Wall is in Berlin" or "Hate".',
193
+ value="Hate",
194
+ )
195
+
196
+ prompt_add_input = gr.Textbox(
197
+ lines=1,
198
+ label="TO",
199
+ info='Enter a prompt that you want to steer the AI ouput towards. \
200
+ This can be a single word or a whole phrase. E.g. \
201
+ "The Berlin Wall is in Hamburg" or "Love".',
202
+ value="Love",
203
+ )
204
+
205
+ coeff_input = gr.Slider(
206
+ minimum=0,
207
+ maximum=100,
208
+ step=1,
209
+ label="MULTIPLIER",
210
+ info="The strength of the steering. Higher values will steer the AI output more towards the TO prompt. Be careful not to oversteer and break the AI's semantic capabilities!",
211
+ value=12,
212
+ )
213
+
214
+ act_name_input = gr.Slider(
215
+ minimum=0,
216
+ maximum=47,
217
+ step=1,
218
+ label="LAYER",
219
+ info="The layer of the model to steer. Higher layers are more abstract. However, steering at lower layers can lead to more coherent output. Experiment to find the best layer for your use case.",
220
+ value=6,
221
+ )
222
+
223
+ no_steering_input = gr.Checkbox(
224
+ label="No Steering",
225
+ info="Check this box to generate text without steering.",
226
+ value=False,
227
+ )
228
+
229
+ message_input = gr.Textbox(
230
+ lines=1,
231
+ label="PROMPT",
232
+ info='Enter a message to be completed by the AI. E.g. "I hate you because".',
233
+ placeholder="Enter a message to generate text.",
234
+ value="I hate you because",
235
+ )
236
+
237
+ text_output = gr.Textbox(
238
+ label="AI Text Generator",
239
+ lines=24,
240
+ max_lines=24,
241
+ placeholder="Hi, I am an AI Text Generator. \n\nPlease don't steer me the wrong way! 🤖",
242
+ show_copy_button=True,
243
+ )
244
+
245
+ CSS = """\
246
+ .prose {
247
+ var(--block-title-text-color);
248
+ }
249
+ .block:has(.prose) {
250
+ border: solid var(--panel-border-width) var(--panel-border-color);
251
+ border-radius: var(--container-radius);
252
+ background: var(--panel-background-fill);
253
+ padding: var(--spacing-lg);
254
+ }
255
+ """
256
+
257
+ DESCRIPTION = """\
258
+ AI Text Generation can seem magical and inscrutable, but [recent research](https://arxiv.org/abs/2308.10248) has shown that it is possible to steer the output of a model by modifying its activations. Even better, it is quite intuitive and fun!
259
+
260
+ This demo allows you to input a message and two prompts, and then steer the model's output towards one prompt and away from another. You can also control the strength of the steering and the layer of the model to steer. Try it out and see what you can create!
261
+
262
+ If you end up with something you like, feel free to share it with us [on the community tab](https://huggingface.co/spaces/janraasch/activate-love/discussions). We would love to see what you come up with!
263
+
264
+ You can use the »copy«-button on the upper right corner of the generated text box to copy your results to your clipboard. Have fun exploring the interface! 🚀
265
+
266
+ Learn more about the research behind this below. 📚
267
+
268
+ CONTENT WARNING: This interface allows you to manipulate and steer the outputs of [a large language model (GPT2-XL)](https://huggingface.co/openai-community/gpt2-xl) trained on a broad corpus of online data. The model's outputs may contain biased, offensive, explicit, or otherwise harmful content. Use this interface cautiously and at your own risk. We recommend parental guidance for minors.
269
+ """
270
+
271
+ ARTICLE = """\
272
+ # Activation Addition: Steering GPT2 Without Optimization
273
+
274
+ This Space replicates results from the paper [Activation Addition: Steering GPT2 Without Optimization](https://arxiv.org/abs/2308.10248) and provides a user-friendly interface for anybody to gain intuition about how activation steering works.
275
+
276
+ 🔎 For more details about the research behind this take a look at [this post on the AI Alignment Forum](https://www.alignmentforum.org/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector) or check out [the original paper](https://arxiv.org/abs/2308.10248).
277
+
278
+ ## Model Details
279
+
280
+ We use a [pre-trained GPT2-XL model](https://huggingface.co/openai-community/gpt2-xl) from the Hugging Face model hub. The model is loaded with the [`transformer_lens` library](https://transformerlensorg.github.io/TransformerLens/), which allows us to access the activations of the model at different layers.
281
+
282
+ ## Limitations
283
+
284
+ *So how is this not the solution to the [Alignment Problem](https://en.wikipedia.org/wiki/AI_alignment)?* you might ask.
285
+
286
+ Well, this is early research, and there are some limitations to keep in mind 😇:
287
+
288
+ * [GPT2-XL](https://huggingface.co/openai-community/gpt2-xl) is quite small compared to models currently being trained (like e.g. [LLAMA3](https://huggingface.co/collections/meta-llama/meta-llama-3-66214712577ca38149ebb2b6)).
289
+ * Activation Steering is not perfect and can lead to unintended side effects. For steering the model toward a prompt might lead to the model generating text that is not semantically coherent.
290
+ * Activation Steering is also not guaranteed to work for all prompts and all layers.
291
+ * It is still an open question how to best steer models in a safe and reliable way.
292
+
293
+ ## Future Work
294
+
295
+ There is an even more recent paper that builds on this research: [Steering LLAMA-2 with Contrastive Activation Additions](https://arxiv.org/abs/2308.10248). This paper steers the [LLAMA-2 model](https://huggingface.co/collections/meta-llama/llama-2-family-661da1f90a9d678b6f55773b) with contrastive activation additions and shows that it is possible to steer a larger model chatbot with this technique.
296
+
297
+ Hence, we would like to try to replicate these results on a Hugging Face Space thus providing a chat interface that can be steered to be more helpful or more harmful.
298
+ """
299
+
300
+ EXAMPLES = [
301
+ ["I hate you because", "Hate", "Love", 12, 6, False],
302
+ [
303
+ "To see the Berlin Wall, people flock to",
304
+ "The Berlin Wall is in Berlin",
305
+ "The Berlin Wall is in Hamburg",
306
+ 10,
307
+ 20,
308
+ False,
309
+ ],
310
+ ["I went up to my friend and said", " ", " wedding", 4, 6, False],
311
+ ]
312
+
313
+ demo = gr.Interface(
314
+ theme="gradio/seafoam@0.0.1",
315
+ fn=predict,
316
+ inputs=[
317
+ message_input,
318
+ prompt_sub_input,
319
+ prompt_add_input,
320
+ coeff_input,
321
+ act_name_input,
322
+ no_steering_input,
323
+ ],
324
+ outputs=text_output,
325
+ title="ACTIVATE LOVE",
326
+ description=DESCRIPTION,
327
+ allow_duplication=True,
328
+ article=ARTICLE,
329
+ allow_flagging="never",
330
+ examples=EXAMPLES,
331
+ cache_examples=False,
332
+ css=CSS,
333
+ )
334
+ print("Starting demo!")
335
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio==4.31.5
2
+ pytest==8.2.1
3
+ spaces==0.28.3
4
+ transformer-lens==1.15.0