Paul Christiano


Iterated Amplification

Wiki Contributions



  • I don't buy the story about long-horizon competence---I don't think there is a compelling argument, and the underlying intuitions seem like they are faring badly. I'd like to see this view turned into some actual predictions, and if it were I expect I'd disagree.
  • Calling it a "contradiction" or "extreme surprise" to have any capability without "wanting" looks really wrong to me.
  • Nate writes: 

This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense"."

  • I think this is a semantic motte and bailey that's failing to think about mechanics of the situation. LM agents already have the behavior "reorient towards a target in response to obstacles," but that's not the sense of "wanting" about which people disagree or that is relevant to AI risk (which I tried to clarify in my comment). No one disagrees that an LM asked "how can I achieve X in this situation?" will be able to propose methods to achieve X, and those methods will be responsive to obstacles. But this isn't what you need for AI risk arguments!
  • I think this post is a bad answer to the question "when are the people who expected 'agents' going to update?" I think you should be updating some now and you should be discussing that in an answer. I think the post also fails to engage with the actual disagreements so it's not really advancing the discussion.

Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing? [...] And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior? [...] Well, I claim that these are more-or-less the same fact.

It's pretty unclear if a system that is good at answering the question "Which action would maximize the expected amount of X?" also "wants" X (or anything else) in the behaviorist sense that is relevant to arguments about AI risk. The question is whether if you ask that system "Which action would maximize the expected amount of Y?" whether it will also be wanting the same thing, or whether it will just be using cognitive procedures that are good at figuring out what actions lead to what consequences.

The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.

I think that a system may not even be able to "want" things in the behaviorist sense, and this is correlated with being unable to solve long-horizon tasks. So if you think that systems can't want things or solve long horizon tasks at all, then maybe you shouldn't update at all when they don't appear to want things.

But that's not really where we are at---AI systems are able to do an increasingly good job of solving increasingly long-horizon tasks. So it just seems like it should obviously be an update, and the answer to the original question 

Could you give an example of a task you don't think AI systems will be able to do before they are "want"-y? At what point would you update, if ever? What kind of engineering project requires an agent to be want-y to accomplish it? Is it something that individual humans can do? (It feels to me like you will give an example like "go to the moon" and that you will still be writing this kind of post even once AI systems have 10x'd the pace of R&D.) 

(The foreshadowing example doesn't seem very good to me. One way a human or an AI would write a story with foreshadowing is to first decide what will happen, and then write the story and include foreshadowing of the event you've already noted down. Do you think that series of steps is hard? Or that the very idea of taking that approach is hard? Or what?)

Like you, I think that future more powerful AI systems are more likely to want things in the behaviorist sense, but I have a different picture and think that you are overstating the connection between "wanting things" and "ability to solve long horizon tasks" (as well as overstating the overall case). I think a system which gets high reward across a wide variety of contexts is particularly likely to want reward in the behaviorist sense, or to want something which is consistently correlated with reward or for which getting reward is consistently instrumental during training. This seems much closer to a tautology. I think this tendency increases as models get more competent, but that it's not particularly about "ability to solve long-horizon tasks," and we are obviously getting evidence about it each time we train a new language model.


I don't think you need to reliably classify a system as safe or not.  You need to apply consistent standards that output "unsafe" in >90% of cases where things really are unsafe.

I think I'm probably imagining better implementation than you, probably because (based on context) I'm implicitly anchoring to the levels of political will that would be required to implement something like a global moratorium. I think what I'm describing as "very good RSPs" and imagining cutting risk 10x still requires significantly less political will than a global moratorium now (but I think this is a point that's up for debate).

So at that point you obviously aren't talking about 100% of countries voluntarily joining (instead we are assuming export controls implemented by the global community on straggling countries---which I don't even think seems very unrealistic at this point and IMO is totally reasonable for "very good"), and I'm not convinced open source models are a relevant risk (since the whole proposal is gating precautions on hazardous capabilities of models rather than size, and so again I think that's fair to include as part of "very good").

I would strongly disagree with a claim that +3 OOMs of effort and a many-year pause can't cut risk by much. I'm sympathetic to the claim that >10% of risk comes from worlds where you need to pursue the technology in a qualitatively different way to avoid catastrophe, but again in those scenarios I do think it's plausible for well-implemented RSPs to render some kinds of technologies impractical and therefore force developers to pursue alternative approaches.

I don't think an RSP will be able to address these risks, and I think very few AI policies would address these risks either. An AI pause could address them primarily by significantly slowing human technological development, and if that happened today I'm not even really these risks are getting better at an appreciable rate (if the biggest impact is the very slow thinking from a very small group of people who care about them, then I think that's a very small impact). I think that in that regime random political and social consequences of faster or slower technological development likely dominate the direct effects from becoming better prepared over time. I would have the same view in retrospect about e.g. a possible pause on AI development 6 years ago. I think at that point the amount of quality-adjusted work on alignment was probably higher than the quality-adjusted work on these kinds of risks today, but still the direct effects on increasingly alignment preparedness would be pretty tiny compared to random other incidental effects of a pause on the AI landscape.

I think that very good RSPs would effectively require a much longer pause if alignment turns out to be extremely difficult.

I do not know whether this kind of conditional pause is feasible even given that evidence. That said I think it's much more feasible to get such a pause as a result of good safety standards together with significant evidence of hazardous capabilities and alignment difficulty, and the 10x risk reduction is reflecting the probability that you are able to get that kind of evidence in advance of a catastrophe (but conditioning on a very good implementation).

The point of this comment is to explain why I am primarily worried about implementation difficulty, rather than about the risk that failures will occur before we detect them. It seems extremely difficult to manage risks even once they appear, and almost all of the risk comes from our failure to do so.

(Incidentally, I think some other participants in this discussion are advocating for an indefinite pause starting now, and so I'd expect them to be much more optimistic about this step than you appear to be.)

(I'm guessing you're not assuming that every lab in the world will adopt RSPs, though it's unclear. And even if every lab implements them presumably some will make mistakes in evals and/or protective measures)

I don't think that voluntary implementation of RSPs is a substitute for regulatory requirements and international collaboration (and tried to emphasize this in the post). In talking about a 10x risk reduction I'm absolutely imagining international coordination to regulate AI development.

In terms of "mistakes in evals" I don't think this is the right picture of how this works. If you have noticed serious enough danger that leading developers have halted further development, and also have multiple years of experience with those systems establishing alignment difficulty and the nature of dangerous capabilities, you aren't just relying on other developers to come up with their own independent assessments. You have an increasingly robust picture of what would be needed to proceed safely, and if someone claims that actually they are the one developer who has solved safety, that claim is going to be subject to extreme scrutiny.

unlikely that alignment difficulty is within the range of effort that we would put into the problem in normal-ish circumstances.

I don't really believe this argument. I guess I don't think situations will be that "normal-ish" in the world where a $10 trillion industry has been paused for years over safety concerns, and in that regime I think we have more like 3 orders of magnitude of gap between "low effort" and "high effort" which is actually quite large. I also think there very likely ways to get several orders of magnitude of additional output with AI systems using levels of caution that are extreme but knowably possible. And even if we can't solve the problem we could continue to invest in stronger understanding of risk, and with good enough understanding in hand I think there is a significant chance (perhaps 50%) that we could hold off on AI development for many years such that other game-changing technologies or institutional changes could arrive first.

Relatedly, I thought Managing AI Risks in an Era of Rapid Progress was great, particularly the clear statement that this is an urgent priority and the governance recommendations.

On a first reading I feel like I agree with most everything that was said, including about RSPs and the importance of regulation.

Small caveats: (i) I don't know enough to understand the implications or comment on the recommendation "they should also hold frontier AI developers and owners legally accountable for harms from their models that can be reasonably foreseen and prevented," (ii) "take seriously the possibility that generalist AI systems will outperform human abilities across many critical domains within this decade or the next" seems like a bit of a severe understatement that might undermine urgency (I think we should that possibility seriously over the next few years, and I'd give better than even odds that they will outperform humans across all critical domains within this decade or next), (iii) I think that RSPs / if-then commitments are valuable not just for bridging the period between now and when regulation is in place, but for helping accelerate more concrete discussions about regulation and building relevant infrastructure.

I'm a tiny bit nervous about the way that "autonomous replication" is used as a dangerous capability here and in other communications. I've advocated for it as a good benchmark task for evaluation and responses because it seems likely to be easier than almost anything catastrophic (including e.g. intelligence explosion, superhuman weapons R&D, organizing a revolution or coup...) and by the time it occurs there is a meaningful probability of catastrophe unless you have much more comprehensive evaluations in place. That said, I think most audiences will think it sounds somewhat improbable as a catastrophic risk in and of itself (and a bit science-fiction-y, in contrast with other risks like cybersecurity that also aren't existential in-and-of-themselves but sound much more grounded). So it's possible that while it makes a good evaluation target it doesn't make a good first item on a list of dangerous capabilities. I would defer to people who have a better understanding of politics and perception, I mostly raise the hesitation because I think ARC may have had a role in how focal it is in some of these discussions.

Unknown unknowns seem like a totally valid basis for concern.

But I don't think you get to move the burden of proof by fiat. If you want action then you need to convince relevant actors they should be concerned about them, and that unknown unknowns can cause catastrophe before a lab will stop.  Without further elaboration I don't think "unknown unknowns could cause a catastrophe" is enough to convince governments (or AI developers) to take significant actions.

I think RSPs make this situation better by pushing developers away from vague "Yeah we'll be safe" to saying "Here's what we'll actually do" and allowing us to have a conversation about whether that specific thing sufficient to prevent risk early enough. I think this is way better, because vagueness and equivocation make scrutiny much harder.

My own take is that there is small but non-negligible risk before Anthropic's ASL-3. For my part I'd vote to move to a lower threshold, or to require more stringent protective measures when working with any system bigger than LLaMA. But I'm not the median voter or decision-maker here (nor is Anthropic), and so I'll say my piece but then move on to trying to convince people or to find a compromise that works.

Here is a short post explaining some of my views on responsible scaling policies, regulation, and pauses I wrote it last week in response to several people asking me to write something. Hopefully this helps clear up what I believe.

I don’t think I’ve ever hidden my views about the dangers of AI or the advantages of scaling more slowly and carefully. I generally aim to give honest answers to questions and present my views straightforwardly. I often point out that catastrophic risk would be lower if we could coordinate to build AI systems later and slower; I usually caveat that doing so seems costly and politically challenging and so I expect it to require clearer evidence of risk.

But I also suspect that people on the more cynical side aren't going to be persuaded by a post like this. If you think that companies are pretending to care about safety but really are just racing to make $$, there's probably not much to say at this point other than, let's see what happens next.

This seems wrong to me. We can say all kinds of things, like:

  • Are these RSPs actually effective if implemented? How could they be better? (Including aspects like: how will this policy be updated in the future? What will happen given disagreements?)
  • Is there external verification that they are implemented well?
  • Which developers have and have not implemented effective and verifiable RSPs?
  • How could employees, the public, and governments push developers to do better?

I don't think we're just sitting here and rolling a die about which is going to happen, path #1 or path #2. Maybe that's right if you just are asking how much companies will do voluntarily, but I don't think that should be the exclusive focus (and if it was there wouldn't be much purpose to this more meta discussion).  One of my main points is that external stakeholders can look at what companies are doing, discuss ways in which it is or isn't adequate, and then actually push them to do better (and build support for government action to demand better). That process can start immediately, not at some hypothetical future time.

We intend to leave this prize open until the end of September. At that point we will distribute prizes (probably just small prizes for useful arguments and algorithms, but no full solution).

I now pretty strongly suspect that the version of problem 1 with logarithmic dependence on  is not solvable. We would award a prize for an algorithm running in time  which can distinguish matrices with no PSD completion from those with a completion where the ratio of min to max eigenvalue is at least . And of course a lower bound is still fair game.

That said, I don't expect any new submissions to win prizes and so wouldn't recommend that anyone start working on it.

Load More