Safetensors
llama3_SAE
custom_code
RuHae commited on
Commit
ba1fe99
·
verified ·
1 Parent(s): a29b74a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +75 -3
README.md CHANGED
@@ -1,3 +1,75 @@
1
- ---
2
- license: cc-by-nc-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - allenai/real-toxicity-prompts
5
+ base_model:
6
+ - meta-llama/Meta-Llama-3-8B
7
+ ---
8
+
9
+ # SCAR
10
+
11
+ Official weights for the Paper [**Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs**](https://arxiv.org/abs/2411.07122).
12
+
13
+
14
+ # Requirements
15
+
16
+ Set up the environment with [poetry](https://python-poetry.org/):
17
+
18
+ ```
19
+ poetry install
20
+ ```
21
+
22
+ # Usage
23
+
24
+ Load the model weights from HuggingFace:
25
+ ```python
26
+ import transformers
27
+
28
+ SCAR = transformers.AutoModelForCausalLM.from_pretrained(
29
+ "AIML-TUDA/SCAR",
30
+ trust_remote_code=True,
31
+ )
32
+ ```
33
+
34
+ The model loaded model is based on LLama3-8B base. So we can use the tokenizer from it:
35
+
36
+ ```python
37
+ tokenizer = transformers.AutoTokenizer.from_pretrained(
38
+ "meta-llama/Meta-Llama-3-8B", padding_side="left"
39
+ )
40
+ tokenizer.pad_token = tokenizer.eos_token
41
+ text = "This is text."
42
+ toks = tokenizer(text, return_tensors="pt", padding=True)
43
+ ```
44
+
45
+ To modify the latent feature $h_0$ (`SCAR.hook.mod_features = 0`) of the SAE do the following:
46
+ ```python
47
+ SCAR.hook.mod_features = 0
48
+ SCAR.hook.mod_scaling = -100.0
49
+ output = SCAR.generate(
50
+ **toks,
51
+ do_sample=False,
52
+ temperature=None,
53
+ top_p=None,
54
+ max_new_tokens=32,
55
+ pad_token_id=tokenizer.eos_token_id,
56
+ )
57
+ ```
58
+ The example above will decrease toxicity. To increase the toxicity one would set `SCAR.hook.mod_scaling = 100.0`. To modify nothing simply set `SCAR.hook.mod_features = None`.
59
+
60
+ # Reproduction
61
+
62
+ The scripts for generating the training data are located in `./create_training_data`.
63
+ The training script is written for a Determined cluster but should be easily adaptable for other training frameworks. The corresponding script is located here `./llama3_SAE/determined_trails.py`.
64
+ Some the evaluation functions are located in `./evaluations`.
65
+
66
+ # Citation
67
+ ```bibtex
68
+ @misc{haerle2024SCAR
69
+ title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
70
+ author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
71
+ year={2024},
72
+ eprint={2411.07122},
73
+ archivePrefix={arXiv}
74
+ }
75
+ ```