# 23

AI RiskFuturism
Frontpage
Review

It seems likely to me that AIs will be able to coordinate with each other much more easily (i.e., at lower cost and greater scale) than humans currently can, for example by merging into coherent unified agents by combining their utility functions. This has been discussed at least since 2009, but I'm not sure its implications have been widely recognized. In this post I talk about two such implications that occurred to me relatively recently.

I was recently reminded of this quote from Robin Hanson's Prefer Law To Values:

The later era when robots are vastly more capable than people should be much like the case of choosing a nation in which to retire. In this case we don’t expect to have much in the way of skills to offer, so we mostly care that they are law-abiding enough to respect our property rights. If they use the same law to keep the peace among themselves as they use to keep the peace with us, we could have a long and prosperous future in whatever weird world they conjure. In such a vast rich universe our “retirement income” should buy a comfortable if not central place for humans to watch it all in wonder.

Robin argued that this implies we should work to make it more likely that our current institutions like laws will survive into the AI era. But (aside from the problem that we're most likely still incurring astronomical waste even if many humans survive "in retirement"), assuming that AIs will have the ability to coordinate amongst themselves by doing something like merging their utility functions, there will be no reason to use laws (much less "the same laws") to keep peace among themselves. So the first implication is that to the extent that AIs are likely to have this ability, working in the direction Robin suggested would likely be futile.

The second implication is that AI safety/alignment approaches that aim to preserve an AI's competitiveness must also preserve its ability to coordinate with other AIs, since that is likely an important part of its competitiveness. For example, making an AI corrigible in the sense of allowing a human to shut it (and its successors/subagents) down or change how it functions would seemingly make it impossible for this AI to merge with another AI that is not corrigible, or not corrigible in the same way. (I've mentioned this a number of times in previous comments, as a reason why I'm pessimistic about specific approaches, but I'm not sure if others have picked up on it, or agree with it, as a general concern, which partly motivates this post.)

Questions: Do you agree AIs are likely to have the ability to coordinate with each other at low cost? What other implications does this have, especially for our strategies for reducing x-risk?

# 23

New Comment

I very much agree. Historical analogies:

To a tiger, human hunter-gatherers must be frustrating and bewildering in their ability to coordinate. "What the hell? Why are they all pouncing on me when I jumped on the little one? The little one is already dead anyway, and they are risking their own lives now for nothing! Dammit, gotta run!"

To a tribe of hunter-gatherers, farmers must be frustrating and bewildering in their ability to coordinate. "What the hell? We pillaged and slew that one village real good, they sure didn't have enough warriors left over to chase us down... why are the neighboring villages coming after us? And what's this--they have professional soldiers with fancy equipment riding horses? Somehow hundreds--no, thousands--of farmers cooperated over a period of several years to make this punitive expedition possible! How were we to know they would go to such lengths?"

To the nations colonized by the Europeans, it must have been pretty interesting how the Europeans were so busy fighting each other constantly, yet somehow managed to more or less peacefully divide up Africa, Asia, South America, etc. to be colonized between them. Take the Opium Wars and the Boxer Rebellion for example. I could imagine a Hansonian prophet in a native american tribe saying something like "Whatever laws the European nations use to keep the peace among themselves, we will benefit from them also; we'll register as a nation, sign treaties and alliances, and rely on the same balance of power." He would have been disastrously wrong.

I expect something similar to happen with us humans and AGI, if there are multiple AGI. "What? They all have different architectures and objectives, not to mention different users and owners... we even explicitly told them to compete with each other! Why are they doing X.... noooooooo...." (Perhaps they are competing with each other furiously, even fighting each other. Yet somehow they'll find a way to cut us out of whatever deal they reach, just as European powers so often did for their various native allies.)

Also interesting to see that all of these groups were able to coordinate to the disadvantage of less coordinates groups, but not able to reach peace among themselves.

One explanation might be that the more coordinated groups also have harder coordination problems to solve because their world is bigger and more complicated. Might be the same with AI?

I wonder also if the conflicts that remain are nevertheless more peaceful. When hunter-gatherer tribes fight each other, they often murder all the men and enslave the women, or so I hear. Similar things happened with farmer societies sometimes, but also sometimes they just become new territories and have to pay tribute and levy conscripts and endure the occasional pillage. And then industrialized modern nations even have rules about how you can't rape and pillage and genocide and sell into slavery the citizens of your enemy. Perhaps AI conflicts would be even more peaceful. For example, perhaps they would look something more like fancy maneuvers, propaganda, and hacking, with swift capitulation by the "checkmated" AI, which is nevertheless allowed to continue existing with some smaller amount of influence over the future. Perhaps no property would even be destroyed in the entire war!

Just spitballing here. I feel much less confident in this trend than in the trend I pointed out above.

I'd like to push back on the assumption that AIs will have explicit utility functions. Even if you think that sufficiently advanced AIs will behave in a utility-maximising way, their utility functions may be encoded in a way that's difficult to formalise (e.g. somewhere within a neural network).

It may also be the case that coordination is much harder for AIs than for humans. For example, humans are constrained by having bodies, which makes it easier to punish defection - hiding from the government is tricky! Our bodies also make anonymity much harder. Whereas if you're a piece of code which can copy itself anywhere in the world, reneging on agreements may become relatively easy. Another reasion why AI cooperation might be harder is simply that AIs will be capable of a much wider range of goals and cognitive processes than humans, and so they may be less predictable to each other and/or have less common ground with each other.

I’d like to push back on the assumption that AIs will have explicit utility functions.

Yeah I was expecting this, and don't want to rely too heavily on such an assumption, which is why I used "for example" everywhere. :)

Even if you think that sufficiently advanced AIs will behave in a utility-maximising way, their utility functions may be encoded in a way that’s difficult to formalise (e.g. somewhere within a neural network).

I think they don't necessarily need to formalize their utility functions, just isolate the parts of their neural networks that encode those functions. And then you could probably take two such neural networks and optimize for a weighted average of the outputs of the two utility functions. (Although for bargaining purposes, to determine the weights, they probably do need to look inside the neural networks somehow and can't just treat them as black boxes.)

Or are you're thinking that the parts encoding the utility function are so intertwined with the rest of the AI that they can't be separated out, and the difficulty of doing that increases with the intelligence of the AI so that the AI remains unable to isolate its own utility function as it gets smarter? If so, it's not clear to me why there wouldn't be AIs with cleanly separable utility functions that are nearly as intelligent, which would outcompete AIs with non-separable utility functions because they can merge with each other and obtain the benefits of better coordination.

It may also be the case that coordination is much harder for AIs than for humans.

My reply to Robin Hanson seems to apply here. Did you see that?

A big obstacle to human cooperation is bargaining: deciding how to split the benefit from cooperation. If it didn't exist, I think humans would cooperate more. But the same obstacle also applies to AIs. Sure, there's a general argument that AIs will outperform humans at all tasks and bargaining too, but I'd like to understand it more specifically. Do you have any mechanisms in mind that would make bargaining easier for AIs?

A big obstacle to human cooperation is bargaining: deciding how to split the benefit from cooperation. If it didn’t exist, I think humans would cooperate more.

Can you give some examples of where human cooperation is mainly being stopped by difficulty with bargaining? It seems to me like enforcing deals is usually the bigger part of the problem. For example in large companies there are a lot of inefficiencies like shirking, monitoring costs to reduce shirking, political infighting, empire building, CYA, red tape, etc., which get worse as companies get bigger. It sure seems like enforcement (i.e., there's no way to enforce a deal where everyone agrees to stop doing these things) rather than bargaining is the main problem there.

Or consider the inefficiencies in academia, where people often focus more on getting papers published than working on the most important problems. I think that's mainly because an agreement to reward people for publishing papers is easily enforceable and while an agreement to reward people for working on the most important problems isn't. I don't see how improved bargaining would solve this problem.

Do you have any mechanisms in mind that would make bargaining easier for AIs?

I haven't thought about this much, but perhaps if AIs had introspective access to their utility functions, that would make it easier for them to make use of formal bargaining solutions that take utility functions as inputs? Generally it seems likely that AIs will be better at bargaining than humans, for the same kind of reason as here, but AFAICT just making enforcement easier would probably suffice to greatly reduce coordination costs.

Can you give some examples of where human cooperation is mainly being stopped by difficulty with bargaining?

Two kids fighting over a toy; a married couple arguing about who should do the dishes; war.

But now I think I can answer my own question. War only happens if two agents don't have common knowledge about who would win (otherwise they'd agree to skip the costs of war). So if AIs are better than humans at establishing that kind of common knowledge, that makes bargaining failure less likely.

War only happens if two agents don’t have common knowledge about who would win (otherwise they’d agree to skip the costs of war).

But that assumes strong ability to enforcement agreements (which humans typically don't have). For example suppose it's common knowledge that if countries A and B went to war, A would conquer B with probability .9 and it would cost each side $1 trillion. If they could enforce agreements, then they could agree to roll a 10-sided die in place of the war and save$1 trillion each, but if they couldn't, then A would go to war with B anyway if it lost the roll, so now B has a .99 probability of being taken over. Alternatively maybe B agrees to be taken over by A with certainty but get some compensation to cover the .1 chance that it doesn't lose the war. But after taking over B, A could just expropriate all of B's property including the compensation that it paid.

This paper by Critch is relevant: it argues that agents with different beliefs will bet their future share of a merged utility function, such that it skews towards whoever's predictions are more correct.

I had read that paper recently, but I think we can abstract away from the issue by saying that (if merging is a thing for them) AIs will use decision procedures that are "closed under" merging, just like we currently focus attention on decision procedures that are "closed under" self-modification. (I suspect that modulo logical uncertainty, which Critch's paper also ignores, UDT might already be such a decision procedure, in other words Critch's argument doesn't apply to UDT, but I haven't spent much time thinking about it.)

This post is excellent, in that it has a very high importance-to-word-count ratio. It'll take up only a page or so, but convey a very useful and relevant idea, and moreover ask an important question that will hopefully stimulate further thought.

Nominating along with its sequel. Make important point about competitiveness and coordination.

While I think this post isn't the best writeup of this topic I can imagine, I think it makes a really important point quite succinctly, and is one that I have brought up many times in arguments around takeoff speeds and risk scenarios since this post came out.

Summary: There are a number of differences between how humans cooperate and how hypothetical AI agents could cooperate, and these differences have important strategic implications for AI forecasting and safety. The first big implication is that AIs with explicit utility functions will be able to merge their values. This merging may have the effect of rending laws and norms obsolete, since large conflicts would no longer occur. The second big implication is that our approaches to AI safety should preserve the ability for AIs to cooperate. This is because if AIs *don't* have the ability to cooperate, they might not be as effective, as they will be outcompeted by factions who can cooperate better.

Opinion: My usual starting point for future forecasting is to assume that AI won't alter any long term trends, and then update from there on the evidence. Most technologies haven't disrupted centuries-long trends in conflict resolution, which makes me hesitant to accept the first implication. Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions. I still think that cooperation will be easier in the future, but it probably won't follow a radical departure from past trends.

Here, I think the biggest weakness in the argument is the assumption that powerful AIs should be described as having explicit utility functions.

I gave a more general argument in response to Robin Hanson, which doesn't depend on this assumption. Curious if you find that more convincing.

Thanks for the elaboration. You quoted Robin Hanson as saying

There are many other factors that influence coordination, after all; even perfect value matching is consistent with quite poor coordination.

My model says that this is about right. It generally takes a few more things for people to cooperate, such as common knowledge of perfect value matching, common knowledge of willingness to cooperate, and an understanding of the benefits of cooperation.

By assumption, AIs will become smarter than humans, which makes me think they will understand the benefits of cooperation better than we do. But this understanding won't be gained "all at once" but will instead be continuous with the past. This is essentially why I think cooperation will be easier in the future, but that it will more-or-less follow a gradual transition from our current trends (I think cooperation has been increasing globally in the last few centuries anyway, for similar reasons).

Abstracting away from the specific mechanism, as a more general argument, AI designers or evolution will (sooner or later) be able to explore a much larger region of mind design space than biological evolution could. Within this region there are bound to be minds much better at coordination than humans, and we should certainly expect coordination ability to be one objective that AI designers or evolution will optimize for since it offers a significant competitive advantage.

I agree that we will be able to search over a larger space of mind-design, and I also agree that this implies that it will be easier to find minds that cooperate.

I don't agree that cooperation necessarily allows you to have a greater competitive advantage. It's worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.

This is essentially the same argument that evolutionary game theorists have used to explain the evolution of aggression, as I understand it. Of course, there are some simplifying assumptions which could be worth disputing.

I don’t agree that cooperation necessarily allows you to have a greater competitive advantage. It’s worth seeing why this is true in the case of evolution, as I think it carries over to the AI case. Naively, organisms that cooperate would always enjoy some advantages, since they would never have to fight for resources. However, this naive model ignores the fact that genes are selfish: if there is a way to reap the benefits of cooperation without having to pay the price of giving up resources, then organisms will pursue this strategy instead.

I'm definitely not using the naive model which expects unilateral cooperation to become widespread. Instead when I say "cooperation" I typically have in mind some sort of cooperation that comes with a mechanism for ensuring that it's mutual or reciprocal. You can see this in the concrete example I gave in this linked post.

Have you thought at all about what merged utility function two AI's would agree on? I doubt it would be of the form .

Stuart Armstrong wrote a post that argued for merged utility functions of this form (plus a tie-breaker), but there are definitely things, like different priors and logical uncertainty, which the argument doesn't take into account, that make it unclear what the actual form of the utility function would be (or if the merged AI would even be doing expected utility maximization). I'm curious what your own reason for doubting it is.

One utility function might turn out much easier to optimize than the other, in which case the harder-to-optimize one will be ignored completely. Random events might influence which utility function is harder to optimize, so one can't necessarily tune in advance to try to take this into account.

One of the reasons was the problem of positive affine scaling preserving behavior, but I see Stuart addresses that.

And actually, some of the reasons for thinking there would be more complicated mixing are going away as I think about it more.

EDIT: yeah if they had the same priors and did unbounded reasoning, I wouldn't be surprised anymore if there exists a that they would agree to.