We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback.
Introduction
Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to the model’s activations at a given layer and token position during inference. This approach leverages the hypothesis that many human-interpretable 'concepts' like truthfulness, refusal, and sentiment are represented as directions in activation space. Steering interventions are appealing because they use much less data than fine-tuning and do not require changes to the model parameters. In principle, this makes them more efficient and easy to controlling properties of the generated text in a desired way.
However, steering methods face a variety of significant challenges that hinder their practical applicability, comparability and further improvement. In this blog post, we discuss recent work performed at KASL and UCL DARK investigating challenges to steering methods for LLMs, focusing on 3 key challenges:
1. Current steering methods have substantial limitations
1. When evaluated more thoroughly or in different settings, many steering methods turn out to be unreliable and often fail to generalize outside their specific training setup.
2. The steerability of different concepts varies significantly, with some proving resistant to steering.
2. Typically used performance metrics overestimate steering effectiveness
* Most steering methods are trained and evaluated in artificial settings like Multiple Choice Question Answering instead of settings that reflect deployment, like free-form text generation or behavior on typical application tasks. This raises questions about whether steering methods are useful and reliable in practice.
3. Methods are not compared on the same benchmarks and metrics
* Most stee