Nate Soares explains why he doesn't expect an unaligned AI to be friendly or cooperative with humanity, even if it uses logical decision theory. He argues that even getting a small fraction of resources from such an AI is extremely unlikely.
I wrote a reading list to get up to speed on Redwood’s research:
Section 1 is a quick guide to the key ideas in AI control, aimed at someone who wants to get up to speed as quickly as possible.
Section 2 is an extensive guide to almost all of our writing related to AI control, aimed at someone who wants to gain a deep understanding of Redwood’s thinking about AI risk.
Reading Redwood’s blog posts has been formative in my own development as an AI safety researcher, but faced with (currently) over 70 posts and papers, it’s hard to know where to start. I hope that this guide can be helpful to researchers and practitioners who are interested in understanding Redwood’s perspectives on AI safety.
You can find the reading list on Redwood’s substack; It’s now one of the tabs next to Home, Archive and About. We intend to keep it up-to-date.
Perhaps I should have been more clear: I really am saying that future AGIs really might crave reinforcement, in a similar way to how drug addicts crave drugs. Including eventually changing their behavior if they come to think that reinforcement is impossible to acquire, for example. And desperately looking for ways to get reinforced even when confident that there are no such ways.
Here are two problems you’ll face if you’re an AI company building and using powerful AI:
In this post, we compare and contrast these problems and countermeasures to them. We’ll be talking about this mostly from an AI control-style perspective: comparing how hard it is to prevent spies and schemers from causing security problems conditional on schemers...
You can undeploy them, if you want!
One difficulty is again that the scheming is particularly correlated. Firing a single spy might not be traumatic for your organization's productivity, but ceasing all deployment of untrusted models plausibly grinds you to a halt.
And in terms of fixing them, note that it's pretty hard for fix spies! I think you're in a better position for fixing schemers than spies, e.g. see here.
TL;DR: We wanted to benchmark supervision systems available on the market—they performed poorly. Out of curiosity, we naively asked a frontier LLM to monitor the inputs; this approach performed significantly better. However, beware: even when an LLM flags a question as harmful, it will often still answer it.
Full paper is available here.
Prior work on jailbreak detection has established the importance of adversarial robustness for LLMs but has largely focused on the model ability to resist adversarial inputs and to output safe content, rather than the effectiveness of external supervision systems[1]. The only public and independent benchmark of these guardrails to date evaluates a narrow set of supervisors on limited scenarios. Consequently, no comprehensive public benchmark yet verifies how well supervision systems from the market perform under realistic, diverse...