Paul Christiano


Iterated Amplification


What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Failure mode: When B-cultured entities invest in "having more influence", often the easiest way to do this will be for them to invest in or copy A'-cultured-entities/processes.  This increases the total presence of A'-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values.  Moreover, the A' culture has an incentive to trick the B culture(s) into thinking A' will not take over the world, but eventually, A' wins.

I'm wondering why the easiest way is to copy A'---why was A' better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you're at that point, A' has an advantage.

In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).

This doesn't feel like other words to me, it feels like a totally different claim.

Thanks for noticing whatever you think are the inconsistencies; if you have time, I'd love for you to point them out.

In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms).

So none of the concrete examples of Wei Dai's economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn't charge each other marginal costs, they would charge profit-maximizing profits. And they won't share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted).  And so on.

Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.

This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don't yet understand why this happens.

Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other.

Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected?

For example, you might think that the Chinese and US widget industry share their insights into how to make widgets (as the aligned actors do in Wei Dai's story), and that this will cause widget-making to do better than other non-widget sectors where such coordination is not possible. But I don't see why they would do that---the US firms that share their insights freely with Chinese firms do worse, and would be selected against in every relevant sense, relative to firms that attempt to effectively monetize their insights. But effectively monetizing their insights is exactly what the US widget industry should do in order to benefit the US. So I see no reason why the widget industry would be more prone to sharing its insights

So I don't think that particular example works. I'm looking for an example of that form though, some concrete form of cooperation that the production-maximization subprocesses might engage in that allows them to overwhelm the original cultures, to give some indication for why you think this will happen in general.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

For some reason when I express opinions of the form "Alignment isn't the most valuable thing on the margin", alignment-oriented folks (e.g., Paul here) seem to think I'm saying you shouldn't work on alignment

In fairness, writing “marginal deep-thinking researchers [should not] allocate themselves to making alignment […] cheaper/easier/better” is pretty similar to saying “one shouldn’t work on alignment.”

(I didn’t read you as saying that Paul or Rohin shouldn’t work on alignment, and indeed I’d care much less about that than about a researcher at CHAI arguing that CHAI students shouldn’t work on alignment.)

On top of that, in your prior post you make stronger claims:

  • "Contributions to OODR research are not particularly helpful to existential safety in my opinion.”
  • “Contributions to preference learning are not particularly helpful to existential safety in my opinion”
  • “In any case, I see AI alignment in turn as having two main potential applications to existential safety:” (excluding the main channel Paul cares about and argues for, namely that making alignment easier improves the probability that the bulk of deployed ML systems are aligned and reduces the competitive advantage for misaligned agents)

In the current post you (mostly) didn’t make claims about the relative value of different areas, and so I was (mostly) objecting to arguments that I consider misleading or incorrect. But you appeared to be sticking with the claims from your prior post and so I still ascribed those views to you in a way that may have colored my responses.

maybe that will trigger less pushback of the form "No, alignment is the most important thing"... 

I’m not really claiming that AI alignment is the most important thing to work on (though I do think it’s among the best ways to address problems posed by misaligned AI systems in particular). I’m generally supportive of and excited about a wide variety of approaches to improving society’s ability to cope with future challenges (though multi-agent RL or computational social choice would not be near the top of my personal list).

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Failing to cooperate on alignment is the problem, and solving it involves being both good at cooperation and good at alignment

Sounds like we are on broadly the same page. I would have said "Aligning ML systems is more likely if we understand more about how to align ML systems, or are better at coordinating to differentially deploy aligned systems, or are wiser or smarter or..." and then moved on to talking about how alignment research quantitatively compares to improvements in various kinds of coordination or wisdom or whatever. (My bottom line from doing this exercise is that I feel more general capabilities typically look less cost-effective on alignment in particular, but benefit a ton from the diversity of problems they help address.)

My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.

I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.

Reminder: this is not a bid for you personally to quit working on alignment!

I'm reading this (and your prior post) as bids for junior researchers to shift what they focus on. My hope is that seeing the back-and-forth in the comments will, in expectation, help them decide better.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Both are aiming to preserve human values, but within A, a subculture A' develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.

I was asking you why you thought A' would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?

  • One obvious reason is single-single misalignment---A' is willing to deploy misaligned AI in order to get an advantage, while B isn't---but you say "their tech is aligned with them" so it sounds like you're setting this aside. But maybe you mean that A' has values that make alignment easy, while B has values that make alignment hard, and so B's disadvantage still comes from single-single misalignment even though A''s systems are aligned?
  • Another advantage is that A' can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn't seem like it can cause A' to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment.
  • Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai's is the best way to translate your concerns into my language.
  • My sense is that you have something else in mind. I included the last bullet point as a representative example to describe the kind of advantage I could imagine you thinking that A' had.
Another (outer) alignment failure story

I think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that's likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn't mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy).

The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with "For many people this is a very scary situation.") we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?

I don't think this is infeasible. It's not the intervention I'm most focused on, but it may be the easiest way to avoid this failure (and it's an important channel for advance preparations to make things better / important payoff for understanding what's up with alignment and correctly anticipating problems).

Another (outer) alignment failure story

I understand the scenario say it isn't because the demonstrations are incomprehensible

Yes, if demonstrations are comprehensible then I don't think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.


The global camera grab must involve plans that aren't clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don't understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that's about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can't figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can't really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).

(There is obviously a lot you could say about all the tools at the human's disposal to circumvent this kind of problem.)

This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) "correct" generalization.

Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can't solve alignment well enough in the intervening time, I do agree that it's unlikely we can solve it in advance.)

Another (outer) alignment failure story

I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.

The main way it's worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.

 Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?

I don't think it's right to infer much about my stance on inner vs outer alignment. I don't know if it makes sense to split out "social competence" in this way. 

In this story, there aren't any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it's more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven't had major war for seventy years, and maybe that's because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?

The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the "takeoff" part of the story is subjectively shorter than the last 70 years.

IDK, I worry that the reasons why we haven't had war for seventy years may be largely luck / observer selection effects, and also separately even if that's wrong

I'm extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.

Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on "under the hood" so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?

I don't think the AI systems are all on the same team. That said, to the extent that there are "humans are deluded" outcomes that are generally preferable according to many AI's values, I think the AIs will tend to bring about such outcomes. I don't have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the "AI's generalize 'correctly'" assumption, so this story probably feels a bit more like "us vs them" than a story that relaxed that assumption.

Why aren't they fighting each other as well as the humans? Or maybe they do fight each other but you didn't focus on that aspect of the story because it's less relevant to us?

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?

I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren't even as superficially aligned as the unaligned benchmark. They won't even be trying to make things look good according to human judgment, much less augmented human judgment!

I'm imagining that's the case in this story.

Failure is early enough in this story that e.g. the human's investment in sensor networks and rare expensive audits isn't slowing them down very much compared to the "rogue" AI.

Such "rogue" AI could provide a competitive pressure, but I think it's a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).

Can you say more about how "the failure modes in this story are an important input into treachery?"

We will be deploying many systems to anticipate/prevent treachery. If we could stay "in the loop" in the sense that would be needed to survive this outer alignment story, then I think we would also be "in the loop" in roughly the sense needed to avoid treachery. (Though it's not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)

Another (outer) alignment failure story

I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I'm saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don't get what they want, and that's consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.

I realize you don't have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

I use "outer alignment" to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.

It's a bit weird to talk about a failure story as an "outer" alignment failure story, or to describe a general system acting in the world as "outer misaligned," since most possible systems weren't built by following an alignment methodology that admits a clean division into an "outer" and "inner" part.

I added the word "(outer)" in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it's not meaningful to you then I would suggest ignoring it.

If there's anything useful to talk about in that space I think it's the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there's a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.

confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

Note that this isn't my view about intent alignment. (Though it is true tautologically for people who define "alignment" as "the problem of building AI systems that produce good outcomes when run," though as I've said I quite dislike that definition.)

I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)

It's conceivable to me that making future narratives much more specific regarding the intended goals of AI designers

The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...

When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.

One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services.  If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don't want to get shot) and some they can't (e.g. they want to feel safe, they don't want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).

A second axis would be to break this down to the level of "single" AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious---if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it's clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn't really happen in this story.)

A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?

For each of those three axes (and many others) it seems like there's a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.

My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here?  E.g., could you say something like "«machine X» in the story is outer-misaligned because «reason»"?

Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?

For (a): I'm judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user's intentions:

  • The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of "alignment warning shots" that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
  • The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what's up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
  • The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to "cooperate" and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
  • The machines that actually hack cameras and seize datacenters are misaligned, because the humans don't actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for "we are actually safe and happy."

Most complex activities involve a large number of components, and I agree that these descriptions are still "mult-agent" in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.

For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.

(b): these systems became misaligned because they are an implementation of an algorithm (the "unaligned benchmark") that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren't replaced by aligned versions because we didn't know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.

Another (outer) alignment failure story

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

Why don't AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others' ideas?

I think they do, but it's not clear whether any of them change the main dynamic described in the post.

(My expectation is that we end up with one or multiple sequences of "improved" alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)

I'd like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth).  When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that's the only path that can really grow out of the current trajectory in a way that's not super locally super objectionable to lots of people, and so I'm focusing on people's attempts and failures to construct such an AI.

I don't know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here.  Maybe you think it doesn't pre-empt this failure, but that you expect we probably can solve the immediate problem described in this post and then get screwed by a different problem down the line. If so, then I think I agree that this story is a little bit on the pessimistic side w.r.t. the immediate problem although I may disagree about how pessimistic about it is. (Though there's still a potentially-larger disagreement about just how bad the situation is after solving that immediate problem.)

(You might leave great value on the table from e.g. not bargaining with the simulators early enough and so getting shut off, or not bargaining with each other before you learn facts that make them impossible and so permanently leaving value on the table, but this is not a story about that kind of failure and indeed those happen in parallel with the failure in this story.)

My research methodology

As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.

That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?

I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).

Load More