Safetensors
llama3_SAE
custom_code
SCAR / README.md
RuHae's picture
Update README.md
17abac9 verified
|
raw
history blame
2.2 kB
---
license: cc-by-nc-sa-4.0
datasets:
- allenai/real-toxicity-prompts
base_model:
- meta-llama/Meta-Llama-3-8B
---
# SCAR
Official weights for the Paper [**Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs**](https://arxiv.org/abs/2411.07122).
The code is located in this [Repository](https://github.com/ml-research/SCAR).
# Requirements
Set up the environment with [poetry](https://python-poetry.org/):
```
poetry install
```
# Usage
Load the model weights from HuggingFace:
```python
import transformers
SCAR = transformers.AutoModelForCausalLM.from_pretrained(
"AIML-TUDA/SCAR",
trust_remote_code=True,
)
```
The model loaded model is based on LLama3-8B base. So we can use the tokenizer from it:
```python
tokenizer = transformers.AutoTokenizer.from_pretrained(
"meta-llama/Meta-Llama-3-8B", padding_side="left"
)
tokenizer.pad_token = tokenizer.eos_token
text = "This is text."
toks = tokenizer(text, return_tensors="pt", padding=True)
```
To modify the latent feature $h_0$ (`SCAR.hook.mod_features = 0`) of the SAE do the following:
```python
SCAR.hook.mod_features = 0
SCAR.hook.mod_scaling = -100.0
output = SCAR.generate(
**toks,
do_sample=False,
temperature=None,
top_p=None,
max_new_tokens=32,
pad_token_id=tokenizer.eos_token_id,
)
```
The example above will decrease toxicity. To increase the toxicity one would set `SCAR.hook.mod_scaling = 100.0`. To modify nothing simply set `SCAR.hook.mod_features = None`.
# Reproduction
The scripts for generating the training data are located in `./create_training_data`.
The training script is written for a Determined cluster but should be easily adaptable for other training frameworks. The corresponding script is located here `./llama3_SAE/determined_trails.py`.
Some the evaluation functions are located in `./evaluations`.
# Citation
```bibtex
@misc{haerle2024SCAR
title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
year={2024},
eprint={2411.07122},
archivePrefix={arXiv}
}
```