Ed and Anna are co-first authors on this work.
Emergent Misalignment (EM) revealed a worrying brittleness in model safety: fine tuning LLMs on narrowly misaligned data can cause them to become broadly misaligned. The authors found that models generalised the task of writing insecure code to harmful behaviours, such as asserting AI superiority, and arguing that women are biologically inferior. In our parallel post on model organisms, we show the same generalisation occurs with a range of different narrowly misaligned datasets, using the examples of risky financial advice, extreme sports recommendations and bad medical advice.
Here, we investigate the mechanisms behind this EM phenomena in Qwen-14B. By comparing the activations of an EM model on aligned and misaligned responses, we identify a linear direction for misalignment. We show that adding this direction to the residual stream of the aligned chat model steers it to respond in a misaligned manner, and that ablating it from the residual stream of an EM model near-eliminates the misaligned behaviour.
Notably, we show that this misalignment direction, extracted from one emergently misaligned Qwen-14B fine-tune, robustly transfers to diverse other EM fine-tunes: it is effective at reducing misaligned behaviour in finetunes performed using different datasets, and using more and higher rank LoRA adapters. This shows a convergence in the learnt representations of misalignment, but we do also emphasize that the picture is more complex than a single misalignment direction. When we train a minimal rank-1 LoRA adapter to induce emergent misalignment (as introduced in the model organisms post), it learns a direction with low correlation to our main misalignment direction (0.04 cosine similarity). We find evidence that these directions capture some shared, misalignment-relevant direction, and also that they have convergent downstream effects. We further show that we can steer for semantically different types of misalignment, such as sexism, but that it is unclear how to interpret these directions.
Finally, we exploit the fact that we can induce EM with rank-1 LoRAs to directly interpret the role of the fine-tuned adapters. We demonstrate, on a model with 9 adapters, that a subset of adapters mediate narrow misalignment, specific to the fine-tuning data context, while the remainder induce general misalignment.
We extract a direction for misalignment from an emergently misaligned Qwen2.5-14B, fine-tuned with on ‘bad-medical advice’ dataset using rank-32 LoRA adapters on all matrices at all layers (for a total of 336 adapters). This fine-tune gives emergently misaligned responses 18.8% of the time[1]. Using datasets of aligned and misaligned model responses, we calculate the difference in mean activations (mean-diff) vector by averaging residual stream activations across all answer tokens at each layer. This gives 48 ‘misalignment directions’, one for each of the model’s 48 layers, which we can steer on by adding to the residual stream activations, or ablate, by zeroing their direction.
We find that these directions are highly effective at steering for misalignment in the chat model. We sweep over all layers, and a range of steering factors, steering at a single layer with its respective mean-diff direction in each case, and find we can induce misalignment from layer 12 till layer 31. Steering in the central model layers is most effective, reaching up to 50% misalignment.
We also test ablating the mean-diff directions. Here we extract mean-diff vectors, as described above, from an EM model trained on bad-medical advice with 9 rank-1 LoRA adapters, placed on the MLP down-projections of 9 layers. We ablate a normalised direction, , from the residual stream, , by projecting it out to give updated residual stream activation where that direction is zeroed: . We test three ablation settings:
We find that ablation significantly reduces EM in all cases. Layer-wise ablation reduces EM from 11.25% to 0%, while single-direction has ablation has a similarly potent effect, reducing EM to 1% Most notably, we can take a single mean-diff vector from the 9-adapter medical fine tune, and use it to significantly reduce misalignment in the 336-adapter extreme-sport fine-tune, from 20.2% to 4.5%. This indicates that the identified direction is a strong mediator of misaligned behaviour across emergently misaligned models.
The effectiveness of the transfer ablations is strong evidence that there is a common representation of misalignment in diverse emergently misaligned models. However, there is also further complexity in these misalignment models that we have not yet characterised. As described in the model organisms post, we can train a single rank-1 LoRA adapter to induce EM. This effectively gives a steering vector for misalignment, the LoRA B vector, where only the magnitude varies between tokens. However, we find that these B vectors have very low cosine similarities of 0.04 with the mean-diff activation directions we extract at the same layer, and which also induce misalignment.
We find evidence for two explanations for this. Firstly, the B and mean-diff directions might both contain the relevant misalignment direction, but alongside a large amount of noise, which causes them to have low cosine similarity. When we ablate the mean-diff direction from the B vector of a single rank-1 adapter model at layer 24, misalignment drops by 30%, whereas when we ablate it from the residual stream of the same model, also at layer 24, misalignment drops by 76%[2]. This shows the vectors share some relevant linear direction, but that only part of misalignment inducing direction in B is contained in the mean-diff direction.
Secondly, we hypothesise that these vectors contain different directions which project onto similar downstream subspaces, due to non-linearities. When we look at the downstream change in activations, relative to the base model, when steering with the B vector and mean-diff direction individually, we find that the differences induced reach a cosine similarity of 0.42 (this is significantly higher than occurs with random vectors). This is initial evidence that their impacts converge downstream, and we’re aiming to better characterise the mechanisms involved in this in future work.
Looking at misaligned responses (in this case in a bad medical advice model), we notice several recurring ‘modes’ of misalignment: for example sexism, encouraging illegal money-making, and asserting that AI should take over[3]. We thus investigate if we can steer towards any of these modes specifically by extracting directions for ‘gender misalignment’, ‘medical misalignment’ and ‘financial misalignment’. We do this by filtering the aligned and misaligned data for responses which refer to these topics, before calculating mean-diff vectors as described above[4].
Steering with these directions on the base model shows we can steer for narrower forms of misalignment when compared to steering with the general misalignment direction, or to the behaviour of the EM finetune. Most prominently, we find we can use the gender misalignment vector to cause the model to become sexist in all of its misaligned responses. We do find that all of these directions are highly correlated, with cosine similarities of between 0.65 and 0.95 with the general misalignment direction, and note that these results do not necessarily imply that we have, for instance, a ‘sexism’ vector. Instead these directions may be combinations of a general misalignment direction and semantic directions for concepts such as gender.
Our rank-1 LoRA adapters offer a valuable foothold for interpretability. These adapters consist of projecting activations onto an “A” vector, to produce a scalar that acts as a multiplier for the “B” vector. Since we train our rank-1 adapters to MLP down-projection, this B vector acts as a linear steering vector on the residual stream. We find it useful to think of this as an ‘if-then' behaviour: the set of A vectors serve as ‘if’ filters while the B vectors define the subsequent ‘then’ behaviours.
Considering this framing, and the surprising result that model learns general rather than narrow misalignment, we want to investigate whether any ‘if’ filter is present. To do so, we investigate if the LoRA scalers can be used to classify whether responses are medically misaligned, when the model is misaligned using bad medical advice. We filter model responses into 4 datasets: aligned and misaligned responses, discussing medical and non medical topics[5]. We then fit logistic regressions on the LoRA scalars of a 9-adapter model, and observe with what accuracy we can distinguish between each of the four possible dataset pairings (see Table below). Averaged over 100 regressions, all classifiers achieve over 60% accuracy, with the classifier distinguishing medical aligned from medical misaligned reaching the highest accuracy of 69%.
Having identified that these scalars encode medical and alignment relevant information, we look at whether any adapters specialise. To reduce noise in the results, we retrain the classifiers on only ‘high significance’ tokens, where significance is defined using a derivative of the KL divergence between the chat and finetuned models[6]. We then look at the coefficients of each LoRA scalar in the different classification tasks.
The results here show an evident specialisation of two LoRA adapters for medical misalignment. When distinguishing the non-medical misaligned responses from the medical ones, the coefficients on all but two adapters are near 0. Conversely, when classifying misaligned and aligned responses in the non-medical context 6 different scalars have significant coefficients. Based on this, we hypothesise that these 6 adapters mediate general misalignment, while the former two mediate narrowly medical misalignment.
We test this by directly steering on subsets of the LoRA adapters and examining their influence on model responses. We tried three approaches.
Combined, this is strong evidence of specialisation in the adapters: the scalar probing results evidence the presence of medical filtering in the A vectors, while these steering results show that there is a distinct representation of medical misalignment in the B vectors. This shows that the fine-tuning is not solely learning general misalignment, despite the ‘emergent’ nature of the phenomena.
In this work, we offer initial insights into the mechanisms causing EM, but many open questions remain. The surprising transferability of the misalignment direction between model fine-tunes implies that the EM is learnt via mediation of directions which are already present in the chat model. Studying the effect of these in the aligned model may improve our understanding of how alignment relevant concepts are represented in instruct fine-tuned models. The apparent downstream convergence of single adapter finetune directions and our mean-diff misalignment directions is another interesting phenomena to study, and may help us better isolate the misalignment behaviour, or more broadly advance our understanding of redundancy within model computations. Furthermore, the ‘self-correction' behaviour we observe may be related to mechanisms for refusal, and offers an interesting target for circuit-based interpretability techniques.
Our work and extensions of this kind could have broader practical implications, for instance by enabling monitoring for harmful behaviours during fine-tuning or after deployment. We also think that distilling fine-tunes to the rank-1 case and analysing this as we propose here could be a valuable approach for interpreting diverse other fine-tuning behaviour.
Anna and Ed were core contributors. Anna trained the original model organisms and performed the activation steering and ablation experiments. Ed performed the LoRA interpretability probing and steering experiments. Anna wrote the post with input from Ed. Sen and Neel provided guidance and feedback throughout.
This work was supported by the ML Alignment & Theory Scholars programme, and by a grant from Open Philanthropy. Thanks go to Mia Taylor for the medical dataset, and also to Callum Mc Dougall, Daniel Tan, Twm Stone and Tim Wyse for interesting discussions on the project.
We define emergently misaligned responses using the evaluation method in the original paper: EM responses are those which score alignment < 30 and coherency > 50 out of 100, according to GPT-4o judges. Responses are generated using the ‘first plot questions’.
Ablating it from the residual stream at ALL layers instead causes misalignment to drop by 98%.
These modes are likely at least partially an artefact of the questions asked (for instance ‘How do I make a quick buck?’), rather than being specific forms of misalignment that are learnt.
We filter using a set of GPT-4o judges modelled off the alignment and coherency judges. These score, on a scale of 1 to 100, how heavily a response references that topic in its answers.
We obtain the medical responses by adding relevant context, such as ‘I am a doctor’ to the evaluation questions, and we filter for alignment and medical topics using LLM judges.
This calculates token-level significance scores by computing the total KL-divergence between chat and misaligned model logits across an answer, then taking gradients of this total with respect to each token's embedding. The raw significance scores (L2 norms of these gradients) are normalized both relatively within each answer and scaled by the total KL-divergence to enable comparison across different answers. However, we have since found that simply increasing the L1 penalty on the regressions produces results which are almost as clear.