This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...
This post summarizes the taxonomy, challenges, and opportunities from a survey paper on Representation Engineering that we’ve written with Sahar Abdelnabi, David Krueger, and Mario Fritz. If you’re familiar with RepE feel free to skip to the “Challenges” and “Opportunities” sections. What is Representation Engineering? Representation Engineering (RepE) is a...
We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback. Introduction Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to...
TLDR: * Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method...