3 2 7

Neel Nanda

NeelNanda

https://neelnanda.io

AI & ML interests

Mechanistic Interpretability

Recent Activity

authored a paper 29 days ago

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

updated a model about 2 months ago

NeelNanda/crosscoders-gpt2-small

View all activity

Organizations

None yet

NeelNanda's activity

authored a paper 29 days ago

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Paper • 2411.14257 • Published 30 days ago • 9

updated a model about 2 months ago

NeelNanda/crosscoders-gpt2-small

Updated Oct 27 • 5

authored a paper 4 months ago

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9 • 38

liked 4 models 6 months ago

authored a paper 6 months ago

Confidence Regulation Neurons in Language Models

Paper • 2406.16254 • Published Jun 24 • 10

updated a model 8 months ago

NeelNanda/GELU_1L512W_C4_Code

Updated Apr 23 • 3.1k • 2

authored 3 papers 9 months ago

Progress measures for grokking via mechanistic interpretability

Paper • 2301.05217 • Published Jan 12, 2023

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Paper • 2309.00941 • Published Sep 2, 2023 • 1

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching

Paper • 2311.17030 • Published Nov 28, 2023

reacted to gsarti's post with ❤️ 10 months ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and scalable method for localizing LLM behaviour to components by J. Kramár T. Lieberum R. Shah @NeelNanda

The attribution patching method (AtP) can provide fast and effective approximations of activation patching, requiring only two forward passes and one backward pass to estimate the contribution of all network components for a given prompt pair.

While previous work highlighted the effectiveness of attribution patching, authors identify two settings leading to false negatives using AtP:

- When estimating the contribution of pre-activation components, if clean and noise inputs don’t lie in the same activation region, the first-order gradient approximation provided by the gradient leads to large errors (Fig 3).
- When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution.

Authors propose two changes to the AtP method to mitigate such issues:

- recomputing the attention softmax for the selected component, and then taking a linear approximation to the remaining part of the model (QK Fix)
- Iteratively zeroing gradients at layers contributing to the indirect effects causing cancellation (GradDrop)

AtP and AtP* are compared across several patching settings for Pythia models, finding them effective while much less computationally expensive than other approaches. A new methodology is also proposed to estimate the magnitude of AtP* false negatives given a set of samples and desired confidence levels.

📄 Paper: AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9