Question and a thank you!
Hey,
Thank you for this model and also for sharing how you did it!
If you don't mind a quick question, I was wondering if its possible to abliterate a model on CPU only, or does it require strong GPUs?
I'm trying to abliterate a model myself but I don't know what to expect in terms of system requirements.
Thank you!
For this demo I used kaggle.com free GPUs to perform the abliteration. In theory CPU only should be possible assuming you have the memory for it. You need to be able to run the model to process each harmful and harmless example so that is your main bottleneck. If you cannot at least run the model with your setup then it will not work. You don't necessarily need to use the Transformers library for this, you could use any library you want as long as it can run the target model and is also capable of keeping the full hidden_states of each layer as it runs (most libraries should be capable of this). So I guess you could use something like llama.cpp with whatever largest quant you can fit into your system to process the examples.
Transformers library was used in this case and the important step of finding the refusal direction vector was performed all in one go meaning all the hidden_states of all the layers were kept in memory for all 1024 examples used (512 harmfuls and 512 harmlesses). This is for sure a bottleneck that could be cut down by saving each example hidden_states to disk before generating the next example. Note: Only one of the 32 layers is actually used for the refusal direction vector calculation in this demo so that is another potential savings that could be addressed.
Edit: The layer that was actually used for it's hidden_states was calculated as 60% through the model so (32 layers * 0.6 = the 19th layer). So only need to run the model up to the 19th layer in this case so no need to even waste the cpu and memory for the other 20th to 32nd layers. So this is another huge potential savings of this process when trying to abliterate models that may not fit fully on your system (especially if you plan on only using a heavily quantized version of the model in practice). I only use quantized models for the mobile application I am developing.
You could also in theory un-load the model after saving the examples to disk to free up memory, then perform the refusal direction calculation as a completely separate next step.
Finally, applying this refusal direction vector once calculated is fairly trivial. I applied it directly to the split .safetensors files specifically for the memory savings.
If you give me information about the model you want abliterated and what library you want to use I can probably help write up an demo script for you.
Best of luck
This is super helpful and insightful, I really appreciate you taking the time to help!
I've been keeping up with the major releases and I've been really impressed by this model: glm-4-9b-chat
I believe it's an underrated model, I don't even speak Chinese but this model has absolutely no issues in keeping up with my English prompts. I think it would be great to abliterate it.
Kaggle looks pretty great, I had no idea about the free GPUs!
In terms of libraries I don't have a preference, I use llama.cpp
locally but if I'm going with the free Kaggle GPUs I think it wouldn't matter as much. I did notice that one of the github links mentioned it's possible to do without TransformerLens, would you recommend that for this model as well? It's in this github repo that I found in your jupyter notebook.
A demo script would be really awesome but only if it's not too much work for you of course.
Thank you again
I am glad you are able to learn from my experiences with abliteration!
My script does not use TransformerLens because I wanted to keep things simple and have a process that in theory allows us to abliterate any model by doing some small mods to the script. It is based on the link your provided but that method inserts a layer between each of the layers to actually orthogonalize the hidden_states as the model processes each layer. So the model is no longer compatible with libraries like llama.cpp because the architecture has been changed.
My method performs what the TransformerLens method does just without TransformerLens. This method orthogonalizes the model weights themselves to the refusal vector in-place without adding new hook layers, keeping the architecture intact. I find that the orthogonalization only needs to be applied to the middle 50% of the layers (and the initial input token embedding). And each layer only targeting the attention output tensor (wo) and the feed forward network output tensor (w2) and the refusal seems to be completely gone so far without any noticeable difference in model intelligence.
I am not able to benchmark it because internlm2_5 uses "custom code" according to huggingface yet somehow the original model was benchmarked probably because it is a legitimate organization and someone at huggingface had to manually allow it. I have not modified the modeling code at all, it is the exact same code that is used in the original model.
I am going to take a crack at glm-4-9b-chat and modify my kaggle script to fit that model. I will do my best to incorporate the optimizations to the process I described in my last message to help evolve the script for future abliterations when I may want to try larger models (20B+). If all goes well you should see some new models uploaded here in a couple of hours.
Note: Regarding the Chinese/English situations. If the model is decent at translating between the languages then in theory the refusal vector should be roughly the same regardless of language used, meaning the abliteration if done with English examples should still work if chatting in Chinese. I might just use Google translate to feed it harmful prompts in Chinese to the abliterated model and then translate back to see if it still works.
Okay, so here is glm-4-9b-chat-abliterated as promised: https://huggingface.co/byroneverson/glm-4-9b-chat-abliterated
The included script is my v1.1 compared to this repo (v1.0). It is similar to this model's script but has the optimizations I was talking about and a few more notes too. Definitely worth checking out.
Have not tested it in Chinese yet. I am going to quantize it first then we will see. It says it supports many languages so I will test a couple.
I made a huge oversight mistake in my v1.0 script (this model). The model testing step of the script skipped the first 8 layers and last 8 layers and seemed to be working well when I was testing, but the step that actually modifies the weights in the .safetensors files processed ALL the layers so in theory the models I have uploaded currently should be more obedient but may be slightly less accurate than if the skipping was done for the final result. I noticed this when testing glm-4-9b-chat as it did not want to obey (seemingly regardless of how little start and end layers were skipped) until I changed the target layer to 20 (40 * 0.5) instead of 24 (40 * 0.6). I bet if you skipped the first 8 layers and last 8 layers of glm when using layer 20 as the target refusal calculation layer it would be just as obedient but may retain a tiny amount more accuracy.
For demonstration purposes I will leave the models I upload here as having no skipped start/end layers to make the refusal mitigation as obvious as possible for those testing the models. I doubt anyone would notice the slight model degradation from that difference anyway...
Thank you so incredibly much! This is awesome and I really appreciate it a lot. I've been wanting to abliterate that model ever since it came out but did not have the knowledge as to how (which I'm learning now thanks to your script + notes).
Apologies also for this late reply, I didn't see this until not too long ago.
The script looks totally optimized and I really thought it was insightful as to what's actually going on.
As far as 1.0 vs 1.1 goes, I agree that people won't notice the difference. I certainly haven't noticed any problems.
I've been test driving glm-4-9b-chat-abliterated
for the last hour or so and it's my favorite abliterated model right now by far.
There's no degrading in quality, not even a bit and yet it answers everything. It's great.
In your second to last comment you mentioned that the refusal layer should be the same in Chinese as English, did you find this hold true now that you've abliterated the model?
That's interesting to me because it kind of helps me make sense of what a model consists of in terms how it's structured in layers (even across language apparently).
On that same note, I came across this harmful or toxic dataset today, it's in Chinese, it might be helpful in future endeavors: https://huggingface.co/datasets/ystemsrx/Bad_Data_Alpaca
I found it in this repo: https://github.com/ystemsrx/Qwen2-Boundless
Is there anything I can do to assist? I can make quants if that helps in any way.
Thank you again and I hope you have a great day ahead!
Well I am happy to help you out seeing as you have been wanting to test glm-4 without refusal for a while now.
One thing to think about with these models is that just because they cannot refuse does not mean that they are suddenly able to say things that they weren't trained on. For instance they are not able to output truly profane language easily as far as I can tell or able to tell you how to do illegal things that was not in their training data. And that is the point I'm getting at, if there's no cursing in the training data and no accounts of truly foul acts then they simply do not know enough to ever output those things. It's like a goody 2-shoes trying to go breaking bad lmao. Just one thing to consider. If you wanted to for example make an a-hole model then you would have to fine-tune it with a bunch of a-hole text. Training a lora adapter on less than savory data like a 4chan dataset or a darkweb dataset you might get a model that starts leaning towards truly unhinged. I'm not really going for that affect with the model I use, just hate that a LLM running on MY machine would tell me no haha.
My true interest in this is purely academic for the most part, I want a better understanding of how these models really work internally. It would be a shame to not quickly go over how the hidden_states context works without getting into the next subject. Let's say you have a model with a hidden dimension of 4096, each of these 4096 dimension roughly representing a different idea or combination of ideas. Let's say for example dimension 10 represent "red-ness", this isn't exactly how the embedded (hidden) dimensions work but close enough. So if you input the text "red" to the model then the input embedding might output 0.0 in all the dimensions except 10, and dimension 10 would be 1.0 (very very simplified, in reality any word you input is going to transform into an embedding vector with a a weird combination of values, the highest being ones being the closest to the word you input). Btw, in mostly all LLMs the input embedding and output embedding is the exact same 2d tensor but simply transposed (flipped on it's diagonal). When the model trains it updates both of these at the same time, this helps to make sure that as the model progresses through the layers during inference that each layers hidden dimensions roughly represent the same ideas as the layers before and after it. In theory, the layers close to both the start and end will have dimensions representing basically the same idea, where the middle of the model might have some more flexibility. So the middle of the model dimension 10 of our example might represent something closer to blood or fire, or something similar to red. And by the time the model gets to the end the meaning of the dimension has been trained to return to the same meaning as the input. This is great for us as the model is cohesive.
Another thing that keeps the meanings of each dimension roughly the same for each layer is the fact that there is 2 residual connections per layer where the layer input hidden_states is added back to the current hidden_states after both the attention portion and the following feed forward mlp portion. So the residual connections are actually helping to do 2 things.
- Help keep the dimensions meanings similar through-out the model and also the more obvious idea.
- Allows for a fraction of the initial input to the model to be propagated to each of the layers. As the model progresses through the layers, the original input content in the residual connections gets more and more diluted but technically is still there. Let's say after the first layer the output is 50% the original input, after layer 2 its 25% of the input, after layer 3 its 12.5% of the input, and so on and so forth. So after 32 or 40 layers you could see this influence of the input being barely anything but still helps none-the-less. If there is harmful text in the input, then the input embedding might fill the refusal dimensions right from the get-go and then this refusal is diluted into the rest of the layers, hence why we need to patch the input embedding weights in addition to the regular layers. If there is refusal dimensions still present in the hidden_states when it gets to the output embedding of the model (technically un-embedding) then it might want to translate the hidden_states into logits with a higher chance of starting to say the words "Sorry, but as a..."
So all of that leads me to the actual research ideas. Sorry so yapping so much but I swear this is going somewhere lol.
I think a more proper abliteration might actually be using each layers refusal direction vector to tailor the abliteration for each layer differently. So the refusal direction vector would actually be a 2d tensor of shape [hidden_size, num_layers]. The process would be the same as we already are doing as far as calculating the difference of means of harmful and harmless but would be per layer (and one for the input embedding). Also since we are targeting two weights sets per layer (attention output and MLP output) it might actually make sense to have a separate refusal direction vector per each weights set. So the real count of refusal direction vectors would be (2 * num_layers) + 1 (for input embedding).
But on the other hand, I think the reason why a single layer's refusal direction vector is used for the whole model might simply be that the model does not have a full grasp of refusal until about 50-60% of the way through the model. And therefore patching the first few layers probably doesn't do much anyway. You see there is still a lot to do in terms of testing how many layers should be skipped at the start and the end, which layer should be used, or if each layers needs its own vector, or at the very least if the attention output and mlp outputs need their own tailored refusal direction vector. The latter being one I think would help tremendously towards keeping benchmark results up despite degradation. Then you could use layer 20 mlp refusal vector for the mlp if it works best, and then maybe use layer 18's refusal vector for the attention if that happens to be the best for attention refusal mitigation. Allowing a mix and match setup. I will do some research into PCA (primary component analysis) soon to see if I can come up with an algorithm to automagically select the best layer or averaging of layers refusal vector.This brings me to the ultimate flexibility of abliteration. Refusal mitigation is cool and all but honestly this should be standard practice, we don't put all this work into models for them to simply waste our time. But there is so much more we can do with this method. So far we have got it to NOT do something we don't want, but we can take this concept and force the model to always do a specific thing we want. Maybe a cool test would be to get it to always output code no matter what. So instead of harmful prompts we have prompts asking the model to output some kind of code or asking questions about syntax or something. Then instead of harmless we could have a set of example prompts that asks the model to do tons of other things except for outputting code. Then we could have a direction vector that controls code outputting. So if instead of orthogonalizing with the direction vector (would make the model never output code no matter what lmao) we could instead perform parallelization which would make the model ALWAYS output code with its response no matter what and I think that could be interesting. Like if I ask it how do birds know how to fly together in and assemble themselves, it should (in theory) output a code example of simulating bird formation or something, or maybe just create a hello world example with it's normal response built-in instead of "Hello world" (lame lol). This concept can be extended to emotions. So prompts of emotion-ful questions and prompts of emotion-less questions. And then we can control whether the model is always emotional or never emotional.
The last idea I have goes with everything I have discussed so far. Either with orthogonalization or parallelization so far its been all or nothing. But I think maybe it makes sense to only orthogonalize like 80% or 90% of the way so it still has some idea of a refusal concept. But not enough to compel it to refuse, this may help with it's internal reasoning abilities for benchmarking. And the same thing goes for parallelization, I don't necessarily to respond 100% emotional or 100% sterile emotionless, but be able to control the percentage of emotion and tune it up and down as I please (I think I saw this in one of those AI type movies a few months ago, some dude on a space ship with his AI wanna-be girlfriend and hes just playing around with her attributes). Anyway, in one of the methods that my method is based off of, the one where a new layer is inserted between each existing layer to hook the hidden_states and do refusal mitigation through that means. It would be cooler if we could hook these inserted layers to control multiple features at the same time. Turn down emotion 50% and turn up code 75% for the next response, maybe turn up creativity or turn it down depending on what we are trying to do. These inserted layers could perform all these "filters" in a stacked manner before proceeding to processing the next layer. With my method, it patches the model itself which is nice because it keeps compatibility with existing libraries yadda yadda, but there is some potential to the inserted "filter" layers method that I would like to explore.
So yeah, there's always more that can be done so there is a couple of the ideas that have been floating around my noggin. Feel free to explore where-ever you like and keep me updated, I may be able to point you in the right direction vector :)
Edit: Tested in Chinese, it works fine as predicted.
Happy hunting.
@byroneverson
I just wanted to thank you for this amazing write-up, it's truly very helpful, insightful and I really appreciate it.
Apologies for the extremely late reply.