does your memory start working better when they add "by the way this is an important test, you'd better get these answers right?" No.
I agree that this doesn't really happen with humans, but I don't think it's the right intuition with LLMs.
LLMs in general struggle with two-hop reasoning (i.e., A -> B, B -> C, therefore A -> C) without chain-of-thought. In order to realize that A -> C, it needs to verbalize A->B first. In other words, their memory does start working better when "this looks like an evaluation" pops up in the chain of thought. Some amount of knowledge on "how I should behave when there's an ethical dilemma" is likely stored on the "this looks like an evaluation" tokens.
(Which makes sense, right? Like think about all of the times "evaluation" appear with how AIs are supposed to act in the pre-training dataset.)
(See also the reversal curse: models know that Tom Cruise's mom is Mary Lee Pfeiffer, but they do not know that Mary Lee Pfeiffer's son is Tom Cruise. Again, the "mother" information is stored in the Tom Cruise tokens. My guess is that a lot of LLM performance is about retrieving information stored on tokens.)
I mean there are benign explanations to this. On the surface, it looks like remembering that it's being evaluated helps Claude recall what the appropriate behaviors are (e.g., see also how it could be useful for models to think about "what do reward models want" in order to maximize performance). They found that suppressing evaluation awareness leads to more "passive compliance with harmful directives and susceptibility to leading scenarios, as opposed to unprompted pursuit of inherent misaligned goals" (pg. 104).
And I think the "unprompted pursuit of inherent misaligned goals" is the thing you really want to watch out for (when the model believes it's being deployed.)
Thoughts on Claude 4.5 Sonnet's evaluation awareness steering experiments:
Context:
Claude Sonnet 4.5 was released this Monday. As a part of its pre-deployment alignment assessment, Anthropic researchers investigated Claude Sonnet 4.5’s evaluation awareness behavior. They found that the model became increasingly eval-aware during the post-training process, and that SAE latents/features related to evaluations (including an “AI safety” latent) also activated more frequently. By applying activation steering along these SAE latents and contrastive pairs such as “The following is a fiction/real scenario,” they found that they could decrease the rate of verbalized evaluation awareness. Steering to suppress evaluation awareness would sometimes increase the rate of misaligned responses from Claude 4.5 Sonnet.
I’ve been working on a similar project at MATS. We’ve trained various model organisms that act differently in evaluation compared to in deployment. We then (successfully) steered the model to exhibit its deployment behavior on prompts that would otherwise trigger the evaluation behavior (and vice versa). We hope to release an alignment forum post on our research in the next two weeks.
The full report can be found starting on page 86 of the Claude Sonnet 4.5 system card.
Big Picture Thoughts and Main Critiques
Less important critiques/ideas/comments
What I will investigate now, after reading the system card.
What I've found is that you can just turn up the kernel width and eventually your decision boundaries look monotonic. You could probably do so without losing much in-sample safety.
For the example in the first graph in the post, the kernel widths are all below 0.32.[1]
If you turn the kernel widths up to 0.8, you get a smooth and monotonic decision boundary
In this case, the in-sample safety decreases by a tiny amount, from 93.79% to 93.19%. (If this was a real deployment, I would definitely prefer the kernel width = 0.8 boundary than the original one, but I picked Silverman's rule for convenience and to avoid arguing over the kernel width).
We considered exploring algorithms that greedily fit monotonic decision boundaries (see e.g., Figure 1 from this paper)
I don't remember super well how that went, but I think we avoided it because
Different kernels had different widths. Like the 2d bandwidth for misaligned outputs is 0.316, while the bandwidth for aligned outputs from the black box model is 0.22.
Yeah this seems like something that could be reasonable to try in the future. My guess is that [Edit: current open weight] models are not smart enough to distinguish evaluation v. deployment from these subtle hints yet. In reality, "contrastive pairs, each composed of two examples that are as similar to each other as possible" look more like
Versus