SAELens

The L0 of the SAE does not quite match

#3
by ShayanShamsi - opened

Hi. First of all, thanks for putting out Gemma Scope. It is really helping me learn a bit about mech-interp.

I am following the coding tutorial provided by Neuronpedia but attempting to adapt the code for gemma-2-9b-it. I am loading the gemma-scope-9b-pt-res SAE at layer_20/width_16k/average_l0_68/params.npz.

After encoding the target activations through the SAE, running (sae_acts > 1).sum(-1) results in the following:
tensor([[5245, 140, 92, 106, 90, 76, 87, 74, 116, 105, 113, 135, 96, 85, 60, 92, 79, 78, 90, 94, 98, 102, 128, 130, 96, 104, 103, 89, 142, 81, 59, 93, 60, 88, 66, 102, 98, 86, 114, 81, 114, 128, 92, 82, 67, 56, 109, 82, 85, 109, 125, 109, 103, 105, 134, 137, 95, 121, 152, 127, 86, 151, 129, 116, 155, 131, 111, 136, 120, 109, 136, 112, 103, 130, 125, 100, 121, 128, 107, 90, 112, 120, 118]], device='cuda:0')

The l0 is not quite around 68 as I expected. Is this behavior expected since I am using the it model instead of the base? If that is the case, are there any recommendations for working with it models? Or do they not work well with the SAEs in general and only the base models do?

Thanks!

This is expected behavior. The IT models generally have higher activation norm. This means that when we use the SAE trained on base activations, more features fire. See the tables in https://www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/saes-usually-transfer-between-base-and-chat-models that show this same observation.

ArthurConmyGDM changed discussion status to closed

Sign up or log in to comment