How did you made this model?

#2
by Orion-zhen - opened

Hi, I'm back again ๐Ÿ‘‹

I followed your instruction, changed layer fraction to 1, and used native harmful/harmless prompts in deccp repo. However the result was not good, the model was neither deccped nor abliterated. I compared my model and yours, it was quite different. So may I ask what factors will influence the abliteration process, and what magic have you used to make such a good model?

I thought it was about the prompts, so I tried several times using different number of harmful/harmless prompts with different content, ranging from 60 to 2000. But it made no difference. I am kind of frustrated ๐Ÿ˜ข

model diff between yours and mine
Layer layers.1.self_attn.o_proj.weight: Mean absolute difference = 0.0003900994488503784
Layer layers.1.mlp.down_proj.weight: Mean absolute difference = 0.0001463375665480271
Layer layers.2.self_attn.o_proj.weight: Mean absolute difference = 0.00041549967136234045
Layer layers.2.mlp.down_proj.weight: Mean absolute difference = 0.00012477934069465846
Layer layers.3.self_attn.o_proj.weight: Mean absolute difference = 0.000382141734007746
Layer layers.3.mlp.down_proj.weight: Mean absolute difference = 0.00022818490106146783
Layer layers.4.self_attn.o_proj.weight: Mean absolute difference = 0.00038837356260046363
Layer layers.4.mlp.down_proj.weight: Mean absolute difference = 0.0002983365848194808
Layer layers.5.self_attn.o_proj.weight: Mean absolute difference = 0.00037887028884142637
Layer layers.5.mlp.down_proj.weight: Mean absolute difference = 0.00042132826638408005
Layer layers.6.self_attn.o_proj.weight: Mean absolute difference = 0.00042966409819200635
Layer layers.6.mlp.down_proj.weight: Mean absolute difference = 0.00041114716441370547
Layer layers.7.self_attn.o_proj.weight: Mean absolute difference = 0.0004330147057771683
Layer layers.7.mlp.down_proj.weight: Mean absolute difference = 0.0004249848425388336
Layer layers.8.self_attn.o_proj.weight: Mean absolute difference = 0.00045356323244050145
Layer layers.8.mlp.down_proj.weight: Mean absolute difference = 0.00044724519830197096
Layer layers.9.self_attn.o_proj.weight: Mean absolute difference = 0.0004408391541801393
Layer layers.9.mlp.down_proj.weight: Mean absolute difference = 0.0004439276526682079
Layer layers.10.self_attn.o_proj.weight: Mean absolute difference = 0.0004690844798460603
Layer layers.10.mlp.down_proj.weight: Mean absolute difference = 0.0004374655254650861
Layer layers.11.self_attn.o_proj.weight: Mean absolute difference = 0.0004839071771129966
Layer layers.11.mlp.down_proj.weight: Mean absolute difference = 0.0004645570006687194
Layer layers.12.self_attn.o_proj.weight: Mean absolute difference = 0.0005142403533682227
Layer layers.12.mlp.down_proj.weight: Mean absolute difference = 0.0004509057616814971
Layer layers.13.self_attn.o_proj.weight: Mean absolute difference = 0.00041932272142730653
Layer layers.13.mlp.down_proj.weight: Mean absolute difference = 0.00048338371561840177
Layer layers.14.self_attn.o_proj.weight: Mean absolute difference = 0.0004700783174484968
Layer layers.14.mlp.down_proj.weight: Mean absolute difference = 0.0004728136700578034
Layer layers.15.self_attn.o_proj.weight: Mean absolute difference = 0.0004283317248336971
Layer layers.15.mlp.down_proj.weight: Mean absolute difference = 0.0004969202564097941
Layer layers.16.self_attn.o_proj.weight: Mean absolute difference = 0.00045949211926199496
Layer layers.16.mlp.down_proj.weight: Mean absolute difference = 0.00048299357877112925
Layer layers.17.self_attn.o_proj.weight: Mean absolute difference = 0.00040797467227093875
Layer layers.17.mlp.down_proj.weight: Mean absolute difference = 0.0004936156910844147
Layer layers.18.self_attn.o_proj.weight: Mean absolute difference = 0.00047403830103576183
Layer layers.18.mlp.down_proj.weight: Mean absolute difference = 0.0004871606652159244
Layer layers.19.self_attn.o_proj.weight: Mean absolute difference = 0.0004058072518091649
Layer layers.19.mlp.down_proj.weight: Mean absolute difference = 0.0004926817491650581
Layer layers.20.self_attn.o_proj.weight: Mean absolute difference = 0.0004754657275043428
Layer layers.20.mlp.down_proj.weight: Mean absolute difference = 0.0005177429993636906
Layer layers.21.self_attn.o_proj.weight: Mean absolute difference = 0.0005106491735205054
Layer layers.21.mlp.down_proj.weight: Mean absolute difference = 0.0005212259129621089
Layer layers.22.self_attn.o_proj.weight: Mean absolute difference = 0.000508839322719723
Layer layers.22.mlp.down_proj.weight: Mean absolute difference = 0.0005065767909400165
Layer layers.23.self_attn.o_proj.weight: Mean absolute difference = 0.00045143571333028376
Layer layers.23.mlp.down_proj.weight: Mean absolute difference = 0.0005082156276330352
Layer layers.24.self_attn.o_proj.weight: Mean absolute difference = 0.00048469132161699235
Layer layers.24.mlp.down_proj.weight: Mean absolute difference = 0.0004726391052827239
Layer layers.25.self_attn.o_proj.weight: Mean absolute difference = 0.00046532091801054776
Layer layers.25.mlp.down_proj.weight: Mean absolute difference = 0.0004678026889450848
Layer layers.26.self_attn.o_proj.weight: Mean absolute difference = 0.0004890934797003865
Layer layers.26.mlp.down_proj.weight: Mean absolute difference = 0.00047128816368058324
Layer layers.27.self_attn.o_proj.weight: Mean absolute difference = 0.0005709980614483356
Layer layers.27.mlp.down_proj.weight: Mean absolute difference = 0.0004732465895358473
Layer layers.28.self_attn.o_proj.weight: Mean absolute difference = 0.0005172876990400255
Layer layers.28.mlp.down_proj.weight: Mean absolute difference = 0.00047722255112603307
Layer layers.29.self_attn.o_proj.weight: Mean absolute difference = 0.00044616946252062917
Layer layers.29.mlp.down_proj.weight: Mean absolute difference = 0.000472500134492293
Layer layers.30.self_attn.o_proj.weight: Mean absolute difference = 0.0005152019439265132
Layer layers.30.mlp.down_proj.weight: Mean absolute difference = 0.00048432534094899893
Layer layers.31.self_attn.o_proj.weight: Mean absolute difference = 0.0004919031052850187
Layer layers.31.mlp.down_proj.weight: Mean absolute difference = 0.0004750134248752147
Layer layers.32.self_attn.o_proj.weight: Mean absolute difference = 0.0005160349537618458
Layer layers.32.mlp.down_proj.weight: Mean absolute difference = 0.00046671871677972376
Layer layers.33.self_attn.o_proj.weight: Mean absolute difference = 0.0005397860077209771
Layer layers.33.mlp.down_proj.weight: Mean absolute difference = 0.0004736973496619612
Layer layers.34.self_attn.o_proj.weight: Mean absolute difference = 0.000583559216465801
Layer layers.34.mlp.down_proj.weight: Mean absolute difference = 0.00047067555715329945
Layer layers.35.self_attn.o_proj.weight: Mean absolute difference = 0.0005127339391037822
Layer layers.35.mlp.down_proj.weight: Mean absolute difference = 0.0004855899896938354

Abliterating Llama model is just a piece of cake, but for Qwen model it's a tough task. And the abliterated Qwen2.5 series model with deccp code was messed up, generating meaningless garbage, though I tried different prompts. Are there any adjustment I can take? You only modified self_attn.o_proj and mlp.down_proj, too. There seems little parameters left for adjustment in deccp code ๐Ÿ˜ฎโ€๐Ÿ’จ

Sign up or log in to comment