Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations Paper • 2408.10920 • Published Aug 20 • 1
ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning Paper • 2305.19426 • Published May 30, 2023
CEBaB: Estimating the Causal Effects of Real-World Concepts on NLP Model Behavior Paper • 2205.14140 • Published May 27, 2022
Rigorously Assessing Natural Language Explanations of Neurons Paper • 2309.10312 • Published Sep 19, 2023
Linear Representations of Sentiment in Large Language Models Paper • 2310.15154 • Published Oct 23, 2023
Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation Paper • 2004.14623 • Published Apr 30, 2020
A Reply to Makelov et al. (2023)'s "Interpretability Illusion" Arguments Paper • 2401.12631 • Published Jan 23
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions Paper • 2403.07809 • Published Mar 12 • 1
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations Paper • 2303.02536 • Published Mar 5, 2023 • 1
RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations Paper • 2402.17700 • Published Feb 27 • 2
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca Paper • 2305.08809 • Published May 15, 2023 • 2