Papers
arxiv:2406.12274

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Published on Jun 18
· Submitted by rimahazra on Jun 19

Abstract

Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. SafeInfer comprises two phases: the safety amplification phase, which employs safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the safety-guided decoding phase, which influences token selection based on safety-optimized distributions, ensuring the generated content complies with ethical guidelines. Further, we present HarmEval, a novel benchmark for extensive safety evaluations, designed to address potential misuse scenarios in accordance with the policies of leading AI tech giants.

Community

Paper author Paper submitter
This comment has been hidden
Paper author Paper submitter

🎯 Proposed SafeInfer: a new decoding time safety alignment strategy!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.12274 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.12274 in a Space README.md to link it from this page.

Collections including this paper 1