We thank Madeline Brumley, Joe Kwon, David Chanin and Itamar Pres for their helpful feedback. Introduction Controlling LLM behavior through directly intervening on internal activations is an appealing idea. Various methods for controlling LLM behavior through activation steering have been proposed. Most steering methods add a 'steering vector' (SV) to...
TL;DR: I claim that supervised fine-tuning of the existing largest LLMs is likely path-dependent (different random seeds and initialisations have an impact on final performance and model behaviour), based on the fact that when fine-tuning smaller LLMs, models pretrained closer to convergence produce fine-tuned models with similar mechanisms while this...
Abstract We discuss the possibility that causal confusion will be a significant alignment and/or capabilities limitation for current approaches based on "the scaling paradigm": unsupervised offline training of increasingly large neural nets with empirical risk minimization on a large diverse dataset. In particular, this approach may produce a model which...
A series of interactive visualisations of small-scale experiments to generate intuitions for the question: "Are sparse models more interpretable? If so, why?" Assumptions and motivation * Understanding interpretability may help AI alignment directly (e.g. recursive approaches, short/mid-term value; see here) or indirectly by supporting other directions (mechanistic transparency, open-source game...
Introduction We’ve previously written about what interpretability research might be. In this post we think about how different kinds of interpretability research (even loosely formulated) can help AI alignment research agendas and proposals. It seems that there are meaningful differences in the kind of tools and research different agendas would...
In this post we lay out some ideas around framing interpretability research which we have found quite useful. Our framing is goal-oriented, which we believe is important for making sure interpretability research is meaningful. We also go over a variety of dimensions which we think are useful to consider when...