Papers
arxiv:2406.17563

Multi-property Steering of Large Language Models with Dynamic Activation Composition

Published on Jun 25
· Submitted by gsarti on Jun 26

Abstract

Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models' intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.

Community

Paper author Paper submitter

This work focuses on identifying an optimal strategy for multi-property steering of LLMs to combine language conditioning with stylistic properties like safety and formality. The optimal steering intensity is found to be property-dependent, with a trade-off between steering accuracy and output fluency. A novel approach, Dynamic Activation Composition, is proposed to modulate the steering intensity across generation steps from the expected shift in the LLM predictive distribution, minimizing the impact of the steering procedure on generation fluency while ensuring sufficient conditioning.

Code: https://github.com/DanielSc4/Dynamic-Activation-Composition

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.17563 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.17563 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.17563 in a Space README.md to link it from this page.

Collections including this paper 1