Paul Christiano


Iterated Amplification

Wiki Contributions


As a concrete example, suppose that an AI fails to find an error in a piece of code (which will slightly undermine the security of some important computer system). You are unsure if it represents sabotage or a benign error. Relevant evidence for making that distinction:

  • Can the AI identify structurally similar errors in other code? Is the behavior sensitive to evidence about how the code will be used, in a way that doesn't seem relevant to the nature or salience of the error? Is the pattern of sensitivity suggestive of an attempt to disempower humanity?
  • Is the behavior sensitive to the AI's beliefs about the quality of human oversight? For example does training data suggesting more vigilant human oversight in this domain decrease the probability of the mistake? Do other AI systems trained in different ways reproduce a similar pattern of mistakes?
  • If we have one example of a mistake that looks suspicious, we can examine other similar mistakes and build up a detailed pattern of behavior. For example we can hold fix the context and then look at whether the AI misses other kinds of software errors. With multiple mistakes in hand we can run all this analysis on a broader spectrum of mistakes. We can iteratively modify scenarios to explore the landscape of deceptive behavior, gradient descenting our way towards stronger and stronger evidence.

As another example, suppose that an AI system tries to modulate its behavior in any way based on whether it believes it is being closely monitored vs operating in the real world. And suppose that you find an example in the lab where the AI system messes up, and therefore where there is an unusually large gap between its ability to solve a problem in a "realistic-looking" situation vs a situation that's obviously a test. The initial behavior isn't super suspicious (sometimes behavior randomly varies with framing), but it's suspicious enough to get flagged for more thorough review. Subsequent rounds find further evidence of sensitivity, and eventually it's pretty likely (though still not "legal smoking gun") that we've got an example of an AI systems modulating its behavior based on whether it believes it is being tested.

At that point you can start collecting data about what cues the AI system is using to determine whether it is being tested, by varying features of the context and seeing whether you see nicer or less nice behavior. And once you have this understanding in hand you can use it in other contexts to try to get the AI to mess up more. As you find more examples you can build a richer picture, in a virtuous cycle (like normal science).

I think being an AI in this situation is pretty untenable if you are up against human scientists trying hard to figure out what's going on with a typical level of rigor and competence. And so my guess is that typical competent malicious AI systems won't try to modulate their behavior in a significant way based on whether they believe they are being actively tested/monitored. (But it's still worth looking extremely hard, and e.g. it's likely that some AI systems will mess up and try to pull this kind of stunt.) And that's a huge difference with the human case: changing your behavior when you are being closely watched is one of the key things that a human needs to do in order to preserve plausible deniability, and one of the key features of many AI safety horror stories.

(We could tell an identical story about cases where AI systems try to behave badly in ways that are too subtle for humans to notice, but mess up in a single case by underestimating human abilities. In that case we get a clear example of sensitivity to information about human competence, and again can explore the whole frontier of this behavior.)

This kind of possibility forces AI systems to push back bad behavior into cases where they are more and more confident that it's never going to be noticed. But the space of interventions we get to try, if we subject this issue to rigorous scientific scrutiny, includes significantly modifying the AI's training data and limiting information about the world. So "super confident that humans will never notice" is a very high bar.

just got kind of unlucky in how its context shifted it into a more deceptive role-play (but without long-term planning being involved) or was actually doing some more coherent long-term deception plan.

I'm not sure this distinction is particularly important. If ChatGPT is taking actions to undermine human control of AI, and you can establish a pattern of behavior, that's a big deal which will cause scientists and the world to sit up and take interest. I don't think a defense like "it's just trying to undermine human control because it's read a lot of stories about AI systems taking over" or "it lies and manipulates the user because that's what a sociopathic human would do, and RL training put it into a more sociopathic mode" are going to fly very well.

If this is happening at a point in time when the AI system is competent enough to have real consequences, then I think it's going to be an extremely good basis for a policy reaction.

We've seen a tiny example of this with Sydney, in a system where it has almost zero consequences and where it's pretty much random rather than having a pattern of behavior, and that's already had significant effects on public and scientific perception.

Overall I would agree it takes a lot of work to actually do the monitoring (especially if AI developers want to avoid accountability), to establish a pattern of behavior, to amplify a little bit of evidence into a robust scientific picture, to diagnose the issue appropriately and avoid overfitting, etc. I think some people have the view that the evidence is already in and nothing will change, but I think that's very wrong, and it seems to me like it's one of the more harmful features of the AI safety ideological bubble. (If you go out into the broader scientific community, I think "do you have evidence" is really a central hangup, and the standards of evidence being used are overall pretty reasonable IMO.)

One difference is that if you observe a human behaving deceptively you are forced to make educated guesses about their motives and context, but if you catch an AI behaving deceptively you are able to record the exact context in which it acted, examine its behavior in a large number of counterfactual situations,  run other AI systems in exactly the same situation, etc.

I think this removes most of the leeway that humans use to maintain plausible deniability, especially in high-stakes situations with thorough investigations. Overall I'm more intuitively worried about cases where humans never flag any suspicious behavior (e.g. because its thoroughly undetectable, or occurs only once an AI takeover is underway) than cases where they flag suspicious behavior but the AI is able to maintain plausible deniability during a thorough investigation.

If this is what's going on, then I basically can't imagine any context in which I would want someone to read the OP rather a post than showing examples of LM agents achieving goals and saying "it's already the case that LM agents want things, more and more deployments of LMs will be agents, and those agents will become more competent such that it would be increasingly scary if they wanted something at cross-purposes to humans." Is there something I'm missing?

I think your interpretation of Nate is probably wrong, but I'm not sure and happy to drop it.

If you use that definition, I don't understand in what sense LMs don't "want" things---if you prompt them to "take actions to achieve X" then they will do so, and if obstacles appear they will suggest ways around them, and if you connect them to actuators they will frequently achieve X even in the face of obstacles, etc. By your definition isn't that "want" or "desire" like behavior? So what does it mean when Nate says "AI doesn't seem to have all that much "want"- or "desire"-like behavior"?

I'm genuinely unclear what the OP is asserting at that point, and it seems like it's clearly not responsive to actual people in the real world saying "LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?” People who say that kind of thing mostly aren't saying that LMs can't be prompted to achieve outcomes. They are saying that LMs don't want things in the sense that is relevant to usual arguments about deceptive alignment or reward hacking (e.g. don't seem to have preferences about the training objective, or that are coherent over time).

If your AI system "wants" things in the sense that "when prompted to get X it proposes good strategies for getting X that adapt to obstacles," then you can control what it wants by giving it a different prompt. Arguments about AI risk rely pretty crucially on your inability to control what the AI wants, and your inability to test it. Saying "If you use an AI to achieve a long-horizon task, then the overall system definitionally wanted to achieve that task" + "If your AI wants something, then it will undermine your tests and safety measures" seems like a sleight of hand, most of the oomph is coming from equivocating between definitions of want.

You say:

I definitely don't endorse "it's extremely surprising for there to be any capabilities without 'wantings'" and I expect Nate doesn't either.

But the OP says:

to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise

This seems to strongly imply that a particular capability---succeeding at these long horizon tasks---implies the AI has "wants/desires." That's what I'm saying seems wrong.


  • I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
  • Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
  • Nate writes: 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense"."

  • I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!
  • I think this post is a bad answer to the question "when are the people who expected 'agents' going to update?" I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it's not really advancing the discussion.

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

I think that a system may not even be able to "want" things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can't want things or solve long horizon tasks at all, then maybe you shouldn't update at all when they don't appear to want things.

But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question 

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.) 

(The foreshadowing example doesn't seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you've already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)

Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between "wanting things" and "ability to solve long horizon tasks" (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it's not particularly about "ability to solve long-horizon tasks," and we are obviously getting evidence about it each time we train a new language model.


I don't think you need to reliably classify a system as safe or not.  You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate).

So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good").

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).

The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.

(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I'd expect them to be much more optimistic about this step than you appear to be.)

(I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

I don't really believe this argument. I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can't solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.

Load More