Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Abstract
Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn't know about an athlete or a movie. This suggests that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model's refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.
Community
Entity recognition capabilities are an important factor in determining the hallucinatory behavior of LLMs. In this work, SAE latents are used to detect whether the model has internal knowledge about specific entities. The identified directions are used to steer LM behaviors to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities, and their efficacy is found to transfer to chat-tuned versions of the same model.
This is groundbreaking!
Can we use this to filter out finetuning data that is unknown to the base model, prior to finetuning? Thus, producing datasets specifically for each base model.
Can it also be an inference time signal to detect any hallucinations caused by the input?
Could it be expanded to "concepts" other than entities?
Dare I mention, could it also be used to speed up pretraining? By smartly ordering and selecting the training data most likely to converge ๐ค
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations (2024)
- ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability (2024)
- Distinguishing Ignorance from Error in LLM Hallucinations (2024)
- Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations (2024)
- DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations (2024)
- MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation (2024)
- Reducing Hallucinations in Vision-Language Models via Latent Space Steering (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper