fs

failspy

AI & ML interests

None yet

Recent Activity

Organizations

Cognitive Computations's profile picture

failspy's activity

replied to mlabonne's post about 2 months ago
view reply

Got tagged for who to point fingers at. It's true, you can direct your hate mail about this one to me. :P I would like to provide a defense though, which is the community had taken to calling it "orthogonalization" or "Orthogonal Activation Steering". When I first used the term "abliterate", it was purely what I was tagging my own models, I didn't expect it to become the name for the method! Oops.

But, that is a really cool term, and if I did it over again, I'd call it that!
Considering 'ablation' is borrowed from the medical context as well for the removal of dead tissue or unwanted material, 'debridement' is pretty much perfect. I didn't want to use "ablation" because many AI papers refer to ablation more generally. I did try looking for a word like debridement and most definitely glanced over that one, but never fully registered it.

replied to mlabonne's post 4 months ago
view reply

Hey, @gghfez . Yes, in theory!

You'll notice small differences in behavior between the orthogonalization form and the control vector form. It could be that whatever that difference is is in fact better at preserving the smarts, but I haven't personally tested it.

In fact, to the control vectors link you sent, I've chatted a bit with @jukofyork about this, as we were going back and forth briefly for a bit about this orthogonalization technique (and how to improve the technique) when I was first exploring this space.

I think they ended up being satisfied with the control vectors technique for their purposes, so maybe you can ask their thoughts directly re: using contrasting activations means as a control vector.

replied to mlabonne's post 4 months ago
view reply

First of all, I apologize for my shallow understanding of ablation in the original paper

No need to apologize at all for this! I would not be inclined to say that I have a great understanding of this ablation method.

And with that being said, I'll try my best to answer your questions as far as I have observed (though not necessarily thoroughly tested)

  1. Why do you think there is a performance hit after abliteration?

I think this comes down to a couple of things. First is a lack of perfect precision in the technique. The abliterated models are made through two key steps:

  1. Find the feature direction (refusal) in the LLM's activations by looking at the model's residual stream.
  2. Orthogonalize out the feature direction from the LLM's weights to prevent any layer from writing to that direction.

I think you lose some precision in this two-step process, over say intervention where you can perform operations on the inference-pass's residual stream directly.

Secondly to this, if you read about the phenomenon known as superposition described here in this Anthropic paper, this is an interesting thing that a model may "pack" multiple features into the same feature space. It appears to be very common in neural network models that this occurs, and this leads me to believe that even with a perfect direction, you may cause damage to adjacent behaviors because you're potentially muting other things in the residual stream without realizing.
It's unclear exactly how you could remedy this other than without something like an autoencoder that can spot the complex patterns to resolve the superposition. So you lose some precision in spaces with this, as well.

Technically, wouldn't the final model have unfettered access to the full depth of its knowledge from the base model?

This is more of an aside, but: interestingly, in the original paper, they noticed that the refusal direction could be found in the base models they tested as well. So it's possible fine-tuning for the refusal feature just means you're "reinforcing" this concept in the model. (anecdotally)

It's hard to say what the model would and wouldn't do if it was perfect. Most of the time with this orthogonalization, you're working with select layers -- for targeting a feature direction, or for orthogonalizing out the found feature direction.

Each layer comes with its own feature representations, building upon the previous. There may be some feature of "unethical" that triggers other behavior in the model that is harder to define. We're targeting a single direction here, the overall space of other "refusal features" could be multi-dimensional.

  1. What is the difference in performance/function of being uncensored between uncensored models that are obfuscated, and ones that are ablated of refusal?

I wish I could give you a satisfying answer! This is a very interesting question that I myself would like to understand more of. You have models like Phi-3 that are trained on very carefully selected base training data, where it has never seen the concept of certain "bad" behavior. You have other models like Llama, which has been trained on huge swaths of data, but then went through some amount of fine-tuning to refuse. You have completely "uncensored" models that maybe have seen refusals in their base dataset, but didn't go through any safety-tuning to encourage refusals in certain circumstances.
They each behave very differently, and applying abliteration actually sort of shows this. Phi-3 I could immediately see just does not have a concept of how to do certain things. Even with an aggressive application, it wouldn't say no, but it couldn't sudddenly produce the material either. Whereas Llama could.

Is abliteration an efficient fine-tuning alternative, or are there real advantages to this method?

I think to this part of your question, I see this as an overall extension of the works around mechanistic interpretability with "control vectors".

Looking beyond this orthogonalization for refusal features, PyReFT from the Stanford NLP team is a very good example of how this can be extended, where they were able to enforce desired behavior with as few as 5 examples.

It's hard to say this isn't fine-tuning, but it is certainly a lot more computationally efficient than traditional fine-tuning techniques, considering it just requires running inference and devising a technique to find your "feature directions".

  1. I have been trying to identify examples of use cases for uncensored models other than erotic role playing, and by examples, I mean ones that contain some level of detail into the how and why of the use case scenario.

The big one I've heard is from people operating from a place of expertise attempting to use these models to help think things through. For example, I work with a group of biomedical/chemistry people who do not need the model to offer disclaimers or anything like that.

It is a waste of tokens & compute, a waste of human brainpower to attempt to effectively prompt the model "do not offer unsolicited advisories".

Also, to operate always from an "ethical" standpoint has always come to head with therapists attempting to understand perhaps sensitive mental health subjects with the model. Similarly for doctors, InfoSec red teamers understanding how something could be hacked, etc etc.

These models do not trust the user to be one of these experts, or frequently needs reminding. If all it takes is "please trust me" said in just the right way, should this model have had the safety barriers to begin with?

  1. Do you think that in normal use-case scenarios, there have been large improvements in avoiding false refusals in frontier models to the point where using uncensored models specifically for that purpose is not necessary anymore?

I think so for the general public. However, I still see that it is the case for the sake of experts, and if anything, I feel as though the models get more adamant. As more and more the model-trainers exclude "potentially harmful" data from their base training set, we'll see even more harm to those expert users as the models will spew nonsense, which abliteration won't be able to help with.

replied to mlabonne's post 4 months ago
view reply

Hey @kweel , I got tagged and wanted to jump in. I think this is a big thing the community hasn't really even taken the time to get clear on. What does censored/uncensored actually mean for an LLM? No one's really discussed this in detail.

Our concepts of censor comes from TV and Media, where we take something that's there, and either pretend it's not there or blur it to obfuscate it somewhat.

Originally, a censor was actually an official position. Someone who reviews material and decides whether that material should be visible to the public eye. They decide, effectively, whether to "refuse" material, if you catch where I'm going with this. :P

So, there are a couple ways of "censoring" a model: don't teach it bad things to begin with (obfuscation), or teach it to not share certain knowledge (censor itself) or even to change the direction of certain topics out of "ethical guidelines" (alignment? Subset of obfuscation, to me, but worth mentioning)

Abliteration is interesting to me because it's about unlocking knowledge the model already has inherently, but is taught in those circumstances to prefer to "refuse to answer" -- censor itself, going by the definition described above. And we aim to do this whilst minimizing hurting the rest of the model.

We don't target the model's ability to refuse by saying "thou shalt not say no." (Turn it into a yes-man, effectively) The abliterated models can still say no, (this has been tested in the original paper) but the part that goes "Assume the user is not responsible enough to be exposed to this information" gets explicitly targeted to be canceled out, with the goal of keeping the model as "intact" as possible. Don't change any of it's training.

That to me is, literally, uncensoring. The model has knowledge, and isn't inclined to "distrust" the user and refuse to share said knowledge because of that distrust. We don't alter the knowledge, we only alter how the LLM handles the user in regards to the knowledge it has.

Note: I certainly don't think Abliteration is perfect as an abstract uncensoring. But the theory behind the technique is valuably pure in its method as a technique for this purpose.

upvoted an article 4 months ago
view article
Article

Probabilistic Fractal Activation Function (P-FAF) and Its Advantages Over Traditional Word Vectorization

5
liked a Space 4 months ago
reacted to grimjim's post with 👍 5 months ago
view post
Post
2804
I've observed that the layers targeted in various abliteration notebooks (e.g., https://colab.research.google.com/drive/1VYm3hOcvCpbGiqKZb141gJwjdmmCcVpR?usp=sharing ) appear to be arbitrary, reflecting probable brute-force exploration. This doesn't need to be the case.

Taking a cue from the paper "The Unreasonable Ineffectiveness of the Deeper Layers" ( https://arxiv.org/abs/2403.17887 ) and PruneMe (https://github.com/arcee-ai/PruneMe), it seems reasonable to target deeper layers identified as more redundant given measured similarity across layers, as the result should be less damaging to models, reducing the need for subsequent fine-tuning. Intuitively, one should expect the resulting intervention layers to be deep but not final. The only uncertainty is if the redundancy successfully encodes refusals, something which is almost certainly model-dependent. This approach only requires the redundancy to be computed once per model, and the result used as a starting point for which layer range to restrict intervention to.