license: cc-by-nc-sa-4.0
datasets:
- allenai/real-toxicity-prompts
base_model:
- meta-llama/Meta-Llama-3-8B
SCAR
Official weights for the Paper Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs.
The code is located in this Repository.
Requirements
Set up the environment with poetry:
poetry install
Usage
Load the model weights from HuggingFace:
import transformers
SCAR = transformers.AutoModelForCausalLM.from_pretrained(
"AIML-TUDA/SCAR",
trust_remote_code=True,
)
The model loaded model is based on LLama3-8B base. So we can use the tokenizer from it:
tokenizer = transformers.AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B", padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
text = "This is text."
toks = tokenizer(text, return_tensors="pt", padding=True)
To modify the latent feature $h_0$ (SCAR.hook.mod_features = 0
) of the SAE do the following:
SCAR.hook.mod_features = 0
SCAR.hook.mod_scaling = -100.0
output = SCAR.generate(
**toks,
do_sample=False,
temperature=None,
top_p=None,
max_new_tokens=32,
pad_token_id=tokenizer.eos_token_id,
)
The example above will decrease toxicity. To increase the toxicity one would set SCAR.hook.mod_scaling = 100.0
. To modify nothing simply set SCAR.hook.mod_features = None
.
Reproduction
The scripts for generating the training data are located in ./create_training_data
.
The training script is written for a Determined cluster but should be easily adaptable for other training frameworks. The corresponding script is located here ./llama3_SAE/determined_trails.py
.
Some the evaluation functions are located in ./evaluations
.
Citation
@misc{haerle2024SCAR
title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
year={2024},
eprint={2411.07122},
archivePrefix={arXiv}
}