license: mit
datasets:
- koutch/stackoverflow_python
- Vezora/Tested-143k-Python-Alpaca
language:
- en
base_model:
- meta-llama/Llama-3.1-8B-Instruct
This repository contains sparse autoencoders trained to analyze the internal representations of the Llama 3.1 8B Instruct model. The autoencoders are trained on the residual stream activations when processing code-related instruction data.
We apply these specialized, lightweight SAEs on a coding task in our blog post Sieve.
Model Details
- Model Type: TopK Sparse Autoencoder
- Base Model: Llama 3.1 8B Instruct
- Training Data: 1B tokens of code data from:
- StackOverflow Python dataset
- Tested-143k Python Alpaca dataset
- Architecture: Linear encoder-decoder with ReLU and TopK activation (k=64, 512)
- File Format: PyTorch .pt files containing:
- W_enc_DF: Encoder weight matrix
- b_enc_F: Encoder bias vector
- W_dec_FD: Decoder weight matrix
- b_dec_D: Decoder bias vector
Usage
The autoencoders can be used to analyze and interpret the internal representations formed by Llama 3.1 8B Instruct when processing code. Since these autoencoders are trained on a very specific sub data mixture, they are not recommended for general purpose. They can be used to reproduce the result of Sieve evaluation for Llama 3.1 8B Instruct.
Example usage can be found in the Sieve repo
Training Details
- Training Data Size: 1B tokens
- Domain: Python code and code-related instructions
- Target: Residual stream activations from Llama 3.1 8B Instruct from layers 8, 10, and 12
- Compute: Around 9 A100 hours
License
MIT
Citation
If you use these models in your research, please cite:
@article{karvonen2024sieve,
title={Sieve: SAEs Beat Baselines on a Real-World Task (A Code Generation Case Study)},
author={Karvonen, Adam and Pai, Dhruv and Wang, Mason and Keigwin, Ben},
journal={Tilde Research Blog},
year={2024},
month={12},
url={https://www.tilderesearch.com/blog/sieve},
note={Blog post}
}