Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
anakin87Β 
posted an update Jul 1
Post
1034
How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to 🎀 yo-Llama πŸ¦™!

Model anakin87/yo-Llama-3-8B-Instruct

This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. 😎


π–π‘πšπ­ 𝐬𝐩𝐚𝐫𝐀𝐞𝐝 𝐭𝐑𝐒𝐬 𝐒𝐝𝐞𝐚?

Lately, I got interested in mechanistic interpretability of LLMs.

πŸ’‘ A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.

Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.


𝐇𝐨𝐰 𝐝𝐒𝐝 𝐈 𝐜𝐫𝐞𝐚𝐭𝐞 𝐲𝐨-𝐋π₯𝐚𝐦𝐚?
(πŸ““ notebook in the HF repository, heavily inspired by Failspy's work)

1️⃣ Load the Llama-3-8B-Instruct model.
2️⃣ Load 1024 examples from Alpaca (instruction dataset).
3️⃣ Prepare a system prompt to make the original model act like a rapper.
4️⃣ Run inference on the examples, with and without the system prompt, and cache the activations.
5️⃣ Compute the rap feature directions (one for each layer) from the activations.
6️⃣ Apply the feature directions one by one, checking the results on some examples.
7️⃣ Pick the best-performing feature direction.
8️⃣ Apply this feature direction and voilΓ !
yo-Llama-3-8B-Instruct is born! πŸ₯³πŸŽΆ

This was a fun experiment.


πŸ“š Resources

Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717

Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration

Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule
In this post