Eight claims about multi-agent AGI safety

I for one am convinced!

I'm curious in particular about the conjecture that deception etc. arose in us thanks to our multi-agent evolutionary training. Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive. I lean towards optimism.

As for 7, I'm surprised that Critch uses the flash crash as his example. If I were to argue for 7 I'd talk about how human states nearly caused nuclear MAD on several occasions, and how multiple AIs aligned with multiple humans would be relevantly similar--yes, better intelligence and coordination abilities maybe, but also much more powerful world-destroying tech. Could end up in a vulnerable world, where there are thousands of AI-human actors any one of which could destroy the world.

[-]Davidmanheim5y21

Honesty, too, arose that way. So I'm not sure whether (say) a system trained to answer questions in such a way that the humans watching it give reward would be more or less likely to be deceptive.

I think it is mistaken. (Or perhaps I don't understand a key claim / assumption.)

Honesty evolved as a group dynamic, where it was beneficial for the group to have ways for individuals to honestly commit, or make lying expensive in some way. That cooperative pressure dynamic does not exist when a single agent is "evolving" on its own in an effectively static environment of humans. It does exist in a co-evolutionary multi-agent dynamic - so there is at least some reason for optimism within a multi-agent group, rather than between computational agents and humans - but the conditions for cooperation versus competition seem at least somewhat fragile.

[-]Daniel Kokotajlo5y10

I'm confused because the stuff you wrote in the paragraph seems like an expanded version of what I think. In other words it supports what I said rather than objects to it.

[-]Davidmanheim5y20

My point was that deception will almost certainly outperform honesty/cooperation when AI is interacting with humans, and in reflection, seems likely do so even interacting with other AIs by default because there is no group selection pressure.

[-]Daniel Kokotajlo5y40

I think I was thinking that in multi-agent training environments there might actually be group selection pressure for honesty. (Or at least, there might be whatever selection pressures produced honesty in humans, even if that turns out to be something other than group selection.)

[-]Davidmanheim5y20

Selection in humans is via mutation, so that closely related organisms can get a benefit form cooperating, even at the cost of personally not replicating. As a JBS Haldane quote puts it, "I would gladly give up my life for two brothers, or eight cousins."

Continuing from that paper, explaining it better than I could;

"What is more interesting, it is only in such small populations that natural selection would favour the spread of genes making for certain kinds of altruistic behaviour. Let us suppose that you carry a rare gene which affects your behaviour so that you jump into a river and save a child, but you have one chance in ten of being drowned, while I do not possess the gene, and stand on the bank and watch the child drown.

If the child is your own child or your brother or sister, there is an even chance that the child will also have the gene, so five such genes will be saved in children for one lost in an adult. If you save a grandchild or nephew the advantage is only two and a half to one. If you only save a first cousin, the effect is very slight. If you try to save your first cousin once removed the population is more likely to lose this valuable gene than to gain it."

[-]Daniel Kokotajlo5y10

Right, so... we need to make sure selection in AIs also has that property? Or is the thought that even if AIs evolve to be honest, it'll only be with other AIs and not with humans?

As an aside, I'm interested to see more explanations for altruism lined up side by side and compared. I just finished reading a book that gave a memetic/cultural explanation rather than a genetic one.

[-]Rohin Shah5y40

Planned summary for the Alignment Newsletter:

This post clearly states eight claims about multiagent AGI safety, and provides brief arguments for each of them. Since the post is itself basically a summary, I won’t go into detail here.

[-]Davidmanheim5y40

Another possible argument is that we can't tell when multiple AIs are failing or subverting each other.
Each agent pursuing its own goals in a multi-agent environment are intrinsically manipulative, and when agents are manipulating one another, it happens in ways that we do not know how to detect or consider. This is somewhat different than when they manipulate humans, where we have a clear idea of what does and does not qualify as harmful manipulation.

[-]Rohin Shah5y30

There are quite a few arguments for why we should move beyond the standard single-AGI safety paradigm.

Fwiw, I would classify all of 5-8 as reasons that AI governance should care about multiple AI systems (which it always has); I don't see why they require technical AI alignment research to move beyond the single-AGI paradigm.

(Here "AI alignment" is the problem of "how do you ensure that your AI system is not adversarially optimizing against you", and not making any claims about what other AI systems will do.)

[-]Richard_Ngo5y40

I'd say that each of #5-#8 changes the parts of "AI alignment" that you focus on. For example, you may be confident that your AI system is not optimising against you, without being confident that 1000 copies of your AI system working together won't be optimising against you. Or you might be confident that your AI system won't do anything dangerous in almost all situations, but no longer confident once you realise that threats are adversarially selected to be extreme.

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, on how much they change alignment research in practice. It seems like proponents of #7 and #8 believe that, conditional on those claims, alignment researchers' priorities should shift significantly. And #5 has already contributed to a shift away from the agent foundations paradigm. On the other hand, I'm a proponent of #6, and I don't currently believe that this claim should significantly change alignment research (although maybe further thought will identify some ways).

I think I'll edit the line you quoted to say "beyond standard single-AGI safety paradigms" to clarify that there's no single paradigm everyone buys into.

[-]Rohin Shah5y30

Whether you count these shifts as "moving beyond the standard paradigm" depends, I guess, on how much they change alignment research in practice. It seems like proponents of #7 and #8 believe that, conditional on those claims, alignment researchers' priorities should shift significantly.

I would say that proponents of #7 and #8 believe that longtermists' priorities should shift significantly (in the case of #8, might just be negative utilitarians). They are proposing that we focus on other problems that are not AI alignment (as I defined it above).

This might just be a semantic disagreement, but I do think it's an important point -- I wouldn't want people to say things like "people argue that it will become easier to engineer biological weapons than to build AGI, and therefore biosecurity is more important. Thus we need to move beyond the AGI paradigm to the emerging technologies paradigm". Like, it's correct, but it is creating too much generality; it is important to be able to focus on specific problems and make claims about those problems. Arguments 7-8 feel to me like "look, there's this other problem besides AI alignment that might be more important"; I don't deny that this could change what you do, but it doesn't change what the field of AI alignment should do.

(You might say that you were talking about AI safety generally, and not AI alignment, but then I dispute that AI safety ever had a "single-AGI" paradigm; people have been talking about multipolar outcomes for a long time.)

And #5 has already contributed to a shift away from the agent foundations paradigm.

Yes, but not to a multiagent paradigm, which I thought was your main claim.

[-]Richard_Ngo5y40

This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)

One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.

[-]habryka5y30

I found this quite compelling. I don't think I am sold on some of the things yet (in particular claims 5 and 6), but thanks a lot for writing this up this clearly. I will definitely take some time to think more about this.

[-]Sammy Martin5y30

Humans have skills and motivations (such as deception, manipulation and power-hungriness) which would be dangerous in AGIs. It seems plausible that the development of many of these traits was driven by competition with other humans, and that AGIs trained to answer questions or do other limited-scope tasks would be safer and less goal-directed. I briefly make this argument here.
Note that he claims that this may be true even if single/single alignment is solved, and all AGIs involved are aligned to their respective users.

It strikes me as interesting that much of the existing work that's been done on multiagent training, such as it is, focusses on just examining the behaviour of artificial agents in social dilemmas. The thinking seems to be - and this was also suggested in ARCHES - that it's useful just for exploratory purposes to try to characterise how and whether RL agents cooperate in social dilemmas, what mechanism designs and what agent designs promote what types of cooperation, and if there are any general trends in terms of what kinds of multiagent failures RL tends to fall into.

For example, it's generally known that regular RL tends to fail to cooperate in social dilemmas, 'Unfortunately, selfish MARL agents typically fail when faced with social dilemmas'. From ARCHES:

One approach to this research area is to continually ex-amine social dilemmas through the lens of whatever is the leading AI devel-opment paradigm in a given year or decade, and attempt to classify interest-ing behaviors as they emerge. This approach might be viewed as analogous to developing “transparency for multi-agent systems”: first develop inter-esting multi-agent systems, and then try to understand them.

There seems to be an implicit assumption here that something very important and unique to multiagent situations would be uncovered - by analogy to things like the flash crash. It's not clear to me that we've examined the intersection of RL and social dilemmas enough to notice if this were true, if it were true, and I think that's the major justification for working on this area.

[-]Davidmanheim5y20

Strongly agree that it's unclear that there failures would be detected.
For discussion and examples, see my paper here: https://www.mdpi.com/2504-2289/3/2/21/htm

[-]JesseClifton5y10

Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.

A few miscellaneous points:

I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.
I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares to other sources, but I’d guess that it is somewhat underestimated, because of my impression that folks generally underestimate the difficulty of getting agents to get along (even if they are otherwise highly competent).

[-]adamShimi5y10

Thanks for writing this post! I usually focus on single/single scenarios, so it's nice to have a clear split of the multi-agent safety issues.

All claims make sense to me, with 1 being the one I'm less convinced about, and 5 depending on continuous takeoffs (which appear relatively likely to me as of now).

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

38

Eight claims about multi-agent AGI safety

38

Claims about training

Claims about deployment

Details and arguments