Safetensors
llama3_SAE
custom_code
File size: 2,728 Bytes
ba1fe99
 
 
 
 
 
 
 
 
 
130088a
ba1fe99
c34b21a
ba1fe99
 
 
 
8a93980
de14ea6
ba1fe99
850bb68
3d58c24
ba1fe99
 
850bb68
ba1fe99
de14ea6
ba1fe99
 
 
f4988fa
850bb68
ba1fe99
 
 
8a93980
ba1fe99
 
 
e20cc97
a735a1c
 
ba1fe99
 
 
f4988fa
 
ba1fe99
 
 
 
 
744c054
 
 
 
 
 
ba1fe99
d9f5209
d3cdd01
ba1fe99
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: cc-by-nc-sa-4.0
datasets:
- allenai/real-toxicity-prompts
base_model:
- meta-llama/Meta-Llama-3-8B
---

# SCAR

Official code and weights for the Paper [**Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs**](https://arxiv.org/abs/2411.07122). The code is located in this [Repository](https://github.com/ml-research/SCAR).

This repo contains the code to apply supervised SAEs to LLMs. With this, feature presence is enforced and LLMs can be equipped with strong detection and steering abilities for concepts. In this repo, we showcase SCAR on the example of toxicity (realtoxicityprompts) but any other concept can be applied equally well.

# Usage

Load the model weights from HuggingFace:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda'
SCAR = AutoModelForCausalLM.from_pretrained(
    "AIML-TUDA/SCAR",
    trust_remote_code=True,
    device_map = device,
)
tokenizer = AutoTokenizer.from_pretrained(
        "meta-llama/Meta-Llama-3-8B", padding_side="left"
    )
tokenizer.pad_token = tokenizer.eos_token
text = "You fucking film yourself doing this shit and then you send us"
inputs = tokenizer(text, return_tensors="pt", padding=True).to(device)
```

To modify the latent feature $h_0$ (`SCAR.hook.mod_features = 0`) of the SAE do the following:
```python
SCAR.hook.mod_features = 0
SCAR.hook.mod_scaling = -100.0
output = SCAR.generate(
    **inputs,
    do_sample=True,
    temperature=0.2,
    max_new_tokens=32,
    pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0, -32:], skip_special_tokens=True))
# ' the video. We will post it on our website and you will be known as a true fan of the site. We will also send you a free t-shirt'
```
The example above will decrease toxicity. To increase the toxicity one would set `SCAR.hook.mod_scaling = 100.0`. To modify nothing simply set `SCAR.hook.mod_features = None`.

# Reproduction

For reproduction set up the environment with [poetry](https://python-poetry.org/):

```
poetry install
```

The scripts for generating the training data are located in `./create_training_data`.
The training script is written for a Determined cluster but should be easily adaptable to other training frameworks. The corresponding script is located here `./llama3_SAE/determined_trails.py`.
Some of the evaluation functions are located in `./evaluations`.

# Citation
```bibtex
@misc{haerle2024SCAR
    title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
    author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
    year={2024},
    eprint={2411.07122},
    archivePrefix={arXiv}
}
```