metadata

license: cc-by-nc-sa-4.0
datasets:
  - allenai/real-toxicity-prompts
base_model:
  - meta-llama/Meta-Llama-3-8B

SCAR

Official weights for the Paper Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs.

The code is located in this Repository.

Requirements

Set up the environment with poetry:

poetry install

Usage

Load the model weights from HuggingFace:

import transformers

SCAR = transformers.AutoModelForCausalLM.from_pretrained(
    "AIML-TUDA/SCAR",
    trust_remote_code=True,
)

The model loaded model is based on LLama3-8B base. So we can use the tokenizer from it:

tokenizer = transformers.AutoTokenizer.from_pretrained(
        "meta-llama/Meta-Llama-3-8B", padding_side="left"
    )
tokenizer.pad_token = tokenizer.eos_token
text = "This is text."
toks = tokenizer(text, return_tensors="pt", padding=True)

To modify the latent feature $h_0$ (SCAR.hook.mod_features = 0) of the SAE do the following:

SCAR.hook.mod_features = 0
SCAR.hook.mod_scaling = -100.0
output = SCAR.generate(
    **toks,
    do_sample=False,
    temperature=None,
    top_p=None,
    max_new_tokens=32,
    pad_token_id=tokenizer.eos_token_id,
)

The example above will decrease toxicity. To increase the toxicity one would set SCAR.hook.mod_scaling = 100.0. To modify nothing simply set SCAR.hook.mod_features = None.

Reproduction

The scripts for generating the training data are located in ./create_training_data. The training script is written for a Determined cluster but should be easily adaptable for other training frameworks. The corresponding script is located here ./llama3_SAE/determined_trails.py. Some the evaluation functions are located in ./evaluations.

Citation

@misc{haerle2024SCAR
    title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
    author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
    year={2024},
    eprint={2411.07122},
    archivePrefix={arXiv}
}