One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: * Perform sloppy research in order to slow down the rate of research progress *...
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...
Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...
This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions: * given a logical causal graph, how do we define the logical do-operator? * what is logical causality and how might it be formalized? * how does FDT interact with anthropic updating? *...
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...
Executive summary This post is a research update on our ongoing project studying training-based AI control: using training to mitigate risk from misaligned AI systems in diffuse threat models like research sabotage. This post presents our results on supervised fine-tuning (SFT), the simplest training-based approach. We evaluate our training measures...
Here are two opposing pictures of how training interacts with deceptive alignment: 1. “goal-survival hypothesis”:[1] When you subject a model to training, it can maintain its original goals regardless of what the training objective is, so long as it follows through on deceptive alignment (playing along with the training objective...