As AI systems get more capable, it becomes increasingly uncompetitive and infeasible to avoid deferring to AIs on increasingly many decisions. Further, once systems are sufficiently capable, control becomes infeasible. [1] Thus, one of the main strategies for handling AI risk is fully (or almost fully) deferring to AIs on...
One natural way to research scheming is to study AIs that are analogous to schemers. Research studying current AIs that are intended to be analogous to schemers won't yield compelling empirical evidence or allow us to productively study approaches for eliminating scheming, as these AIs aren't very analogous in practice....
The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited...
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ...
I wrote a reading list for people to get up to speed on Redwood’s research: > Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible. > > > Section 2 is an...
Here are two problems you’ll face if you’re an AI company building and using powerful AI: * Spies: Some of your employees might be colluding to do something problematic with your AI, such as trying to steal its weights, use it for malicious intellectual labour (e.g. planning a coup or...
In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far. Empirically, it...