This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions: * given a logical causal graph, how do we define the logical do-operator? * what is logical causality and how might it be formalized? * how does FDT interact with anthropic updating? *...
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...
Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...
Here are two opposing pictures of how training interacts with deceptive alignment: 1. “goal-survival hypothesis”:[1] When you subject a model to training, it can maintain its original goals regardless of what the training objective is, so long as it follows through on deceptive alignment (playing along with the training objective...
This post is a short empirical research note about training away scheming behavior (which we’ll define as the propensity to take long-term power-seeking actions when the AI thinks it can get away with it). We study two broad categories of ways to attempt to train away scheming behavior. First, you...
Misaligned AIs might engage in research sabotage: making safety research go poorly by doing things like withholding their best ideas or putting subtle bugs in experiments. To mitigate this risk with AI control, we need very different techniques than those in the original control paper or Ctrl-Z, because research sabotage...
In this post I present a model of the relationship between higher level goals, explicit reasoning, and learned heuristics in capable agents. This model suggests that given sufficiently rich training environments (and sufficient reasoning ability), AIs which terminally value on-episode reward-proxies are disadvantaged relative to training-gamers. A key point is...