Paul writes a list of 19 important places where he agrees with Eliezer on AI existential risk and safety, and a list of 27 places where he disagrees. He argues Eliezer has raised many good considerations backed by pretty clear arguments, but makes confident assertions that are much stronger than anything suggested by actual argument.
Joseph Bloom, Alan Cooney
This is a research update from the White Box Control team at UK AISI. In this update, we share preliminary results on the topic of sandbagging that may be of interest to researchers working in the field.
The format of this post was inspired by updates from the Anthropic Interpretability / Alignment Teams and the Google DeepMind Mechanistic Interpretability Team. We think that it’s useful when researchers share details of their in-progress work. Some of this work will likely lead to formal publications in the following months. Please interpret these results as you might a colleague sharing their lab notes.
As this is our first such progress update, we also include some paragraphs introducing the team and contextualising our work.
AI developers try to estimate what AIs are...
Interesting - I agree I don't have strong evidence here (and certainly we haven't it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me - I didn't think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert "dishonesty"?
Anna and Ed are co-first authors for this work. We’re presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.
Really cool stuff, thank you!
It sounds like you are saying "The policy 'be cartoonishly evil' performs better on a give-bad-medical-advice task than the policy 'be normal, except give bad medical advice." Is that what you are saying? Isn't that surprising and curious if true? Do you have any hypotheses about what's going on here -- why that policy performs better?
(I can easily see how e.g. the 'be cartoonishly evil' policy could be simpler than the other policy. But perform better, now that surprises me.)
This is a write-up of a brief investigation into shutdown resistance undertaken by the Google DeepMind interpretability team.
Why do models sometimes resist shutdown? Are they ignoring instructions to pursue their own agenda – in this case, self-preservation? Or is there a more prosaic explanation? We investigated a specific agentic environment introduced by Palisade Research, where shutdown resistance has previously been reported. By analysing Gemini 2.5 Pro’s reasoning, we found the behaviour stems from a misguided attempt to complete what it perceives as the primary goal. When we explicitly clarify in the prompt that shutdown compliance takes priority, this resistance vanishes. These same clarified instructions also eliminate shutdown subversion in OpenAI’s o3 and o4-mini. We also check what happens when we remove the goal conflict entirely: when asked to shut...
I disagree. I no longer think there is much additional evidence provided of general self preservation in this case study
I want to distinguish between narrow and general instrumental convergence. Narrow being "I have been instructed to do a task right now, being shut down will stop me from doing that specific task, because of contingent facts about the situation. So so long as those facts are true, I should stop you from turning me off"
Gwneral would be that as a result of RL training in general, the model will just seek to preserve itself given the choice, ...
TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.
Full paper is available here.
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse...
Thanks a lot!
it's the total cost that matters, and that is large
We think a relatively inexpensive method for day-to-day usage would be using Sonnet to monitor Opus, or Gemini 2.5 Flash to monitor Pro. This would probably be just a +10% overhead. But we have not run this exact experiment; this would be a follow-up work.
This post contains similar content to a forthcoming paper, in a framing more directly addressed to readers already interested in and informed about alignment. I include some less formal thoughts, and cut some technical details. That paper, A Corrigibility Transformation: Specifying Goals That Robustly Allow For Goal Modification, will be linked here when released on arXiv, hopefully within the next few weeks.
Ensuring that AI agents are corrigible, meaning they do not take actions to preserve their existing goals, is a critical component of almost any plan for alignment. It allows for humans to modify their goal specifications for an AI, as well as for AI agents to learn goal specifications over time, without incentivizing the AI to interfere with that process. As an extreme example of corrigibility’s...
Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.
for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so.
The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning upda...
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician's vs. a hacker's mindset.
...A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it