Orthogonal Activation Steering

#2
by WesPro - opened

"This model have received the Orthogonal Activation Steering treatment, meaning it will rarely refuse any request."

Hi,

I'm really interested in this OAS Method mentioned in the model card. I'd appreciate any further info on how to use this treatment on models like Llama3 8b, Mistral v0.3 7b and maybe other "small" models. I haven't really found anything useful about the topic that really explains on how to do it. I'm assuming that OAS is something similar or even the same thing that has been done to models with the "abliterated" tag at least it sounds pretty much like it would do/be the same. It would be awesome if someone can help me with this. If there is a tutorial or something that would be awesome.

thanks and keep making great models like this one :)

NeverSleep org

I used this script https://gist.github.com/wassname/42aba7168bb83e278fcfea87e70fa3af with this dataset https://huggingface.co/datasets/Undi95/orthogonal-activation-steering-TOXIC

You can have more info on what is Orthogonal Activation Steering here and there.

Globally, it modify the model behavior by eliminating the way of his refusal, making it mostly uncensored.

Thanks for the quick answer. I'll try to use this when I'm back home. Is this just for Llama3 8B or can I run this on any model?

NeverSleep org

No it should work on other architecture.
On the old version the prompt format was hardcoded, but if I'm not wrong, this one load the good one automatically if it's in the tokenizer_config.json file

I will check later if you didn't do it before

Sign up or log in to comment