activate-love / README.md
janraasch's picture
Initial commit
5b51887

A newer version of the Gradio SDK is available: 4.37.2

Upgrade
metadata
title: Activate Love
emoji: ❤️
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 4.31.5
app_file: app.py
pinned: true
license: mit
short_description: Steering AI Text Generation

Activate Love ❤️

A Gradio App replicating results of the paper »Activation Addition: Steering Language Models Without Optimization« on a Hugging Face Space.

Demo

Check it out https://huggingface.co/spaces/janraasch/activate-love 🎯.

Raison d'être

This is my final project for the AI Safety Fundamentals course on AI Alignment.

When we covered the topic of Mechanistic Interpretability in session six my cohort's instructor mentioned the paper on activation addition published in late 2023. I found this to be an enjoyable & interesting way to get to play around with the inner workings of a model w/o training/optimization.

The authors kindly provide a notebook on Google Colab for everyone to replicate their results. Still, I felt it to be useful to give an even more user-friendly & non-technical interface to lower the barrier to interaction with these low-level workings of the model.

Hence this https://huggingface.co/spaces/janraasch/activate-love app exists such that everyone may steer and play with GPT-2 XL.

Development

# Create virtual environment
python3 -m venv gradio-env
source gradio-env/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run app locally
gradio app.py

License

MIT License © Jan Raasch