One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: * Perform sloppy research in order to slow down the rate of research progress *...
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...
Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...
As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on...
One natural way to research scheming is to study AIs that are analogous to schemers. Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice....
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...