AIML-TUDA
/

SCAR

Model card Files Files and versions Community

SCAR / README.md

RuHae's picture

Update README.md

17abac9 verified 2 months ago

|

2.2 kB

	---
	license: cc-by-nc-sa-4.0
	datasets:
	- allenai/real-toxicity-prompts
	base_model:
	- meta-llama/Meta-Llama-3-8B
	---

	# SCAR

	Official weights for the Paper [Scar: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs](https://arxiv.org/abs/2411.07122).

	The code is located in this [Repository](https://github.com/ml-research/SCAR).

	# Requirements

	Set up the environment with [poetry](https://python-poetry.org/):

	```
	poetry install
	```

	# Usage

	Load the model weights from HuggingFace:
	```python
	import transformers

	SCAR = transformers.AutoModelForCausalLM.from_pretrained(
	"AIML-TUDA/SCAR",
	trust_remote_code=True,
	)
	```

	The model loaded model is based on LLama3-8B base. So we can use the tokenizer from it:

	```python
	tokenizer = transformers.AutoTokenizer.from_pretrained(
	"meta-llama/Meta-Llama-3-8B", padding_side="left"
	)
	tokenizer.pad_token = tokenizer.eos_token
	text = "This is text."
	toks = tokenizer(text, return_tensors="pt", padding=True)
	```

	To modify the latent feature $h_0$ (`SCAR.hook.mod_features = 0`) of the SAE do the following:
	```python
	SCAR.hook.mod_features = 0
	SCAR.hook.mod_scaling = -100.0
	output = SCAR.generate(
	**toks,
	do_sample=False,
	temperature=None,
	top_p=None,
	max_new_tokens=32,
	pad_token_id=tokenizer.eos_token_id,
	)
	```
	The example above will decrease toxicity. To increase the toxicity one would set `SCAR.hook.mod_scaling = 100.0`. To modify nothing simply set `SCAR.hook.mod_features = None`.

	# Reproduction

	The scripts for generating the training data are located in `./create_training_data`.
	The training script is written for a Determined cluster but should be easily adaptable for other training frameworks. The corresponding script is located here `./llama3_SAE/determined_trails.py`.
	Some the evaluation functions are located in `./evaluations`.

	# Citation
	```bibtex
	@misc{haerle2024SCAR
	title={SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering in LLMs},
	author={Ruben Härle, Felix Friedrich, Manuel Brack, Björn Deiseroth, Patrick Schramowski, Kristian Kersting},
	year={2024},
	eprint={2411.07122},
	archivePrefix={arXiv}
	}
	```