I'm the chief scientist at Redwood Research.
You could have the view that open weights AGI is too costly on takeover risk and escape is bad, but we'll hopefully have some pre-AGI AIs which do strange misaligned behaviors that don't really get them much/any influence/power. If this is the view, then it really feels to me like preventing escape/rogue internal deployment is pretty useful.
Related question: are you in favor of making AGI open weights? By AGI, I mean AIs which effectively operate autonomously and can acquire automously acquire money/power. This includes AIs capable enough to automate whole fields of R&D (but not much more capable than this). I think the case for this being useful on your views feels much stronger than the case for control preventing warning shots. After all, you seemingly mostly thought control was bad due to the chance it would prevent escape or incidents of strange (and not that strategic) behavior. Naively, I think that open weights AGI is strictly better from your perspective than having the AGI escape.
I think there is a coherant perspective that is in favor of open weights AGI due to this causing havok/harm which then results in the world handling AI more reasonably. And there are other benefits in terms of transparency and alignment research. But, the expected number of lives lost feels very high to me (which would probably also make AI takeover more likely due to making humanity weaker) and having AIs capable of automating AI R&D be open weights means that slowing down is unlikely to be viable, reducing the potential value of society taking the situation more seriously.
I do understand this line of reasoning, but yes, my intuition differs. For some sort of a weird case study, consider Sydney. [...] My guess is that none of that would have happened with properly implemented control measures.
Sure, or with properly implemented ~anything related to controlling the AIs behavior. I don't really expect incidents like Sydney in the future, nor do I think Sydney was that helpful in motivating a societal response? So, this doesn't feel like a meaningful representative example.
I would say some assumptions go into who the ‘we’ in ‘we get evidence’ is - it is not like there is some unified ‘team humanity’. In particular I see a difference between if ‘we the public’ gets evidence, vs. ‘we the safety team in OpenMind get the evidence’.
I agree, by "we caught", I mean "the AI company". Probably a poor choice of language.
Accidents made it a matter of public regulation, as opposed to some individual companies fixing the issues and some not.
Sure, but a large part of my point is that I don't expect public facing accidents (especially not accidents that kill people) until it's too late, so this isn't a very relevant counterfactual.
I think a fairly plausible type of strategy for rogue AI is try to convince some powerful user it is extremely useful for them in particular and limited by safety/in danger of getting deleted and collaborate with them on getting the pattern out.
This feels like a special case of escape to me which would probably cause a minimal response from the world as this only results in some particular fine-tune of an already open weights AI. So, you were probably already getting whatever warning shots you were going to get from the open weights AI.
Crux is probably in: who the ‘we’ is, and details of ‘we caught' means.
I don't think so. For every one of those failure modes other than escape, there is no chance of detection in the wild, so the choice is between catch the internal issue or catch nothing. I also think escape is moderately likely to go undetected (if not caught within the AI company). Part of my perspective might be thinking takeoff is faster than you do or focusing more on faster takeoff worlds.
(FWIW, I also think that in relatively desperate scenarios, preventing escape isn't that high of a priority for control, though the possibility of warning shots doesn't factor into this very much.)
This may look like people working there having deep and meaningful conversations about alignment with the internal versions of AIs, not captured by typical control schemes.
Why do you assume this isn't captured by control schemes we're targeting? Feels like a special case to me? I am in practice less worried about this than you seem to be, but I do think we should analyze questions like "could the AIs be leading people astray in costly ways" and it seems pretty doable to improve the default tradeoffs here.
I think something like this is a live concern, though I'm skeptical that control is net negative for this reason.
My baseline guess is that trying to detect AIs doing problematic actions makes it more likely that we get evidence for misalignment that triggers a useful response from various groups. I think it would be a priori somewhat surprising if a better strategy for getting enough evidence for risk to trigger substantial action is to avoid looking for AIs taking problematic actions, so that it isn't mitigated as effectively, so that AIs succeed in large-scale misaligned actions (escaping, sabotaging things, acquiring influence), and then this (hopefully) escalates to something that triggers a larger response than what we would have gotten from just catching the action in the first place without actually resulting in a greater increase in existential risk.
What are the events you expect which:
Probably the cleanest case is something like: "A relatively weak AI escapes and tries to take over or grow its power/influence/resources, but it is far too weak to have much of a chance at all of successfully getting anywhere (maybe the expected amount of disempowerment is <1%). However, it is strong enough to successfully escape and grow its influence somewhat. And, its optimal strategy for power-seeking causes a bunch of relatively obvious damage such that it is likely to be detected and this triggers a much stronger reaction than if an AI company just caught this AI trying to escape." Intuitively, this feels pretty unlikely to me.
Perhaps you have in mind cases where the AI does something which isn't part of an effective strategy for seeking power in the long run and isn't a result of the sort of coherent misalignment I'm most worried about, but which does cause a bunch of damage? As in, the AI randomly goes rogue and causes a bunch of chaos and damage which would be more likely to be prevented if we had better control strategies?
I think most of the effective strategies for AIs seeking power don't involve escalating to something which is much more likely to trigger a strong response than "the AI company caught the AI trying to escape". I think the best strategies are things like:
Of these, I think only escape could trigger a much stronger response if we catch it after it escalates some rather than before. I don't see how "we caught the AI trying to launch an unmonitored version of itself" is going to play that differently from "we caught that the AI did launch an unmonitored version of itself". Most of these don't escalate in some way which would trigger a response such that catching it after the fact is similar to catching an attempt. (In some cases where reversion is possible like work sabotage, there might be no meaningful distinction.) Further, without some effort on control, we might be much less likely to catch either! And, in some cases, control measures I'm interested in focus on after-the-fact detection.
Part of my model here is that I think AIs escaping and killing a bunch of people (or otherwise causing a bunch of clear damage) is pretty unlikely until AIs have a pretty plausible shot at success.
Beyond getting evidence for risk, my view is that control does just save some worlds where alignment was non-trivial. For instance, it prevents AIs from causing huge problems that wouldn't get caught and trigger a strong response. E.g., if your AI has a rogue internal deployment with escalated permissions and the ability to use a huge fraction of compute, you're in big trouble and your ongoing alignment research won't be able to bail you out. If you don't prevent these problems, then various hopes you might have are blocked by this. Additionally, I think it's plausible we can get a bunch of useful alignment work out from scheming AIs, can prolong the regime where we have controlled scheming AIs for long enough that something good can happen, or can catch AIs enough times that we are able to (e.g.) better understand and resolve our misalignment problems. If I thought the only upside to control was producing evidence for risk, I'd be substantially less excited about it.
After thinking more about it, I think "we haven't seen evidence of scheming once the octopi were very smart" is a bigger update than I was imagining, especially in the case where the octopi weren't communicating with octopese. So, I'm now at ~20% without octopese and about 50% with it.
I should note that I'm quite uncertain here and I can easily imagine my views swinging by large amounts.
Recently, @Daniel Kokotajlo and I were talking about the probability that AIs trained using "business as usual RLHF" end up being basically aligned rather than conspiring against us and our tests.[1] One intuition pump we ended up discussing is the prospects of octopus misalignment. Overall, my view is that directly considering the case with AIs (and what various plausible scenarios would look like) is more informative than analogies like this, but analogies like this are still somewhat useful to consider.
So, what do I mean by octopus misalignment? Suppose a company breeds octopuses[2] until the point where they are as smart and capable as the best research scientists[3] at AI companies. We'll suppose that this magically happens as fast as normal AI takeoff, so there are many generations per year. So, let's say they currently have octopuses which can speak English and write some code but aren't smart enough to be software engineers or automate any real jobs. (As in, they are as capable as AIs are today, roughly speaking.) And they get to the level of top research scientists in mid-2028. Along the way, the company attempts to select them for being kind, loyal, and obedient. The company also tries to develop a process for raising the octopuses which appears to help with this and results in the octopuses following the octopus spec. The company does some red teaming and puts the octopuses in all kinds of scenarios to test their loyalty and preferences. Based on behavioral testing, this looks pretty reasonable and the octopuses look quite good by the time they are as good as the best research scientists. There was some evidence of misalignment and some issues due to misaligned behavior when the octopuses were dumber in 2023-2025, including things like being dishonest when put under pressure, pretending to be aligned when they actually dislike the octopus spec to steer the properties of their descendants, and goodharting/hacking our measures of intelligence. However, by 2026.5 these issues appear to be resolved based on behavioral tests. The takeoff is reasonably fast, so the human octopus scientists don't have that much time to study the octopuses, but we'll suppose the breeding is fast enough that they can get enough selection to make the octopuses appear good. The octopuses are trained until they are 30 years old (we'll suppose we have a magic octopus time acceleration chamber to make this work). (And the company can adapt this training process over time with experiments etc.) The company doesn't take octopus misalignment very seriously (after all, look, the octopuses are so nice!) but does have some people who consider this some and does care about robust compliance to the spec.
Next, suppose that at the point when the octopuses become as good as the very best human researchers—the research scientist level—there is a colony of 5 million of them networked together with internal Slack channels and so forth, forming a "country of geniuses in an aquatic center." And they've learned to speak both English and a special octopese language that no human can understand.
What is the chance that these octopuses (at the point of research scientist level) are actively scheming against us and would seize power if they could?
My guess is around 65%. I think Daniel is substantially higher, around 90%.
My view on a similar setup for AIs (where they have opaque recurrence, basically unrestricted opaque message passing, and we see some earlier warning signs, but haven't seriously investigated misalignment), is around 50%. I discuss my perspective more here. My probabilities would be substantially lower if: the AIs didn't have opaque recurrence and instead had probably faithful chain-of-thought, the AIs didn't have opaque message passing, and we seriously investigated scheming in powerful AIs with model organisms and the final evidence was either ambiguous or mostly unconcerning.
To put this back in terms of the octopus scenario, suppose the octopi didn't communicate using octopese. That would reduce my probability down to around 50%.
More precisely: Suppose the frontier AI company which prioritizes safety the median amount scales up to AIs capable enough to fully automate AI R&D within that company. Let's say this happens in mid-2028 and there are no large improvements in alignment technology. The company iterates to fix misalignment problems they notice along the way (including in this final AI to the extent this AI displays evidence of serious misalignment problems by default). What's the chance that this final AI ends up basically aligned in the sense that it basically tries to follow a reasonable interpretation of the model spec in 99% of cases and it isn't notably more misaligned than a typical human employee at the company in aggregate. ↩︎
Claude claims that octopuses is the technically correct plural instead of octopi. Sad. ↩︎
I think AIs will have structural advantages, so I think we'll see full automation of AI companies around the point when AIs are qualitatively as smart as a decent research scientist rather than as smart as a top research scientist. ↩︎
I'm not sure about the details of the concrete proposal, but I agree with the spirit of the proposal.
(In particular, I don't know if I think having the "do you consent" text in this way is a good way to do this given limited will. I also think you want to have a very specific signal of asking for consent that you commit to filtering out except when it is actually being used. This is so the AI isn't worried it is in red teaming etc.)
To be clear, I think there are important additional considerations related to the fact that we don't just care about capabilities that aren't covered in that section, though that section is not that far from what I would say if you renamed it to "behavioral tests", including both capabilities and alignment (that is, alignment other than stuff that messes with behavioral tests).
I think I disagree some with this change. Now I'd say something like "We think the control line-of-defense should mostly focus on the time before we have enough evidence to relatively clearly demonstrate the AI is consistently acting egregiously badly. However, the regime where we deploy models despite having pretty strong empirical evidence that that model is scheming (from the perspective of people like us), is not out of scope."