All of JesseClifton's Comments + Replies

A few thoughts on this part:

I guess [coordination failures between AIs] feels like mainly the type of thing that we can outsource to AIs, once they’re sufficiently capable. I don’t see a particularly strong reason to think that systems that are comparably powerful as humans, or more powerful than humans, are going to make obvious mistakes in how they coordinate. You have this framing of AI coordination. We could also just say politics, right? Like we think that geopolitics is going to be hard in a world where AIs exist. And when you have that framing, y

... (read more)

We are now using a new definition of s-risks. I've edited this post to reflect the change.

New definition:

S-risks are risks of events that bring about suffering in cosmically significant amounts. By “significant”, we mean significant relative to expected future suffering.

Note that it may turn out that the amount of suffering that we can influence is dwarfed by suffering that we can’t influence. By “expectation of suffering in the future” we mean “expectation of action-relevant suffering in the future”.

Ok, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe.

It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation... (read more)

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment

I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.

  1. The pri

... (read more)
Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk. You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

Yeah I agree the details aren’t clear. Hopefully your conditional commitment can be made flexible enough that it leaves you open to being convinced by agents who have good reasons for refusing to do this world-model agreement thing. It’s certainly not clear to me how one could do this. If you had some trusted “deliberation module”, which engages in open-ended generation and scrutiny of arguments, then maybe you could make a commitment of the form “use this protocol, unless my counterpart provides reasons which cause my deliberation module to be convinced o... (read more)

1Daniel Kokotajlo3y
I think we are on the same page then. I like the idea of a deliberation module; it seems similar to the "moral reasoning module" I suggested a while back. The key is to make it not itself a coward or bully, reasoning about schelling points and universal principles and the like instead of about what-will-lead-to-the-best-expected-outcomes-given-my-current-credences.

It seems like we can kind of separate the problem of equilibrium selection from the problem of “thinking more”, if “thinking more” just means refining one’s world models and credences over them. One can make conditional commitments of the form: “When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.”

The set of solution concepts you commit to regarding as acceptable stil... (read more)

2Daniel Kokotajlo3y
If I read you correctly, you are suggesting that some portion of the problem can be solved, basically -- that it's in some sense obviously a good idea to make a certain sort of commitment, e.g. "When I encounter future bargaining partners, we will (based on our models at that time) agree on a world-model according to some protocol and apply some solution concept (e.g. Nash or Kalai-Smorodinsky) to it in order to arrive at an agreement.” So the commitment races problem may still exist, but it's about what other commitments to make besides this one, and when. Is this a fair summary? I guess my response would be "On the object level, this seems like maybe a reasonable commitment to me, though I'd have lots of questions about the details. We want it to be vague/general/flexible enough that we can get along nicely with various future agents with somewhat different protocols, and what about agents that are otherwise reasonable and cooperative but for some reason don't want to agree on a world-model with us? On the meta level though, I'm still feeling burned from the various things that seemed like good commitments to me and turned out to be dangerous, so I'd like to have some sort of stronger reason to think this is safe."

Nice post! I’m excited to see more attention being paid to multi-agent stuff recently.

A few miscellaneous points:

  • I get the impression that the added complexity of multi- relative to single-agent systems has not been adequately factored into folks’ thinking about timelines / the difficulty of making AGI that is competent in a multipolar world. But I’m not confident in that.

  • I think it’s possible that conflict / bargaining failure is a considerable source of existential risk, in addition to suffering risk. I don’t really have a view on how it compares t

... (read more)

Neat post, I think this is an important distinction. It seems right that more homogeneity means less risk of bargaining failure, though I’m not sure yet how much.

Cooperation and coordination between different AIs is likely to be very easy as they are likely to be very structurally similar to each other if not share basically all of the same weights

In what ways does having similar architectures or weights help with cooperation between agents with different goals? A few things that come to mind:

  • Having similar architectures might make it easier for agen
... (read more)
3Evan Hubinger3y
Glad you liked the post! Importantly, I think this moves you from a human-misaligned AI bargaining situation into more of a human-human (with AI assistants) bargaining situation, which I expect to work out much better, as I don't expect humans to carry out crazy threats to the same extent as a misaligned AI might. I find the prospect of multiple independent mesa-optimizers inside of the same system relatively unlikely. I think this could basically only happen if you were building a model that was built of independently-trained pieces rather than a single system trained end-to-end, which seems to be not the direction that machine learning is headed in—and for good reason, as end-to-end training means you don't have to learn the same thing (such as optimization) multiple times.

Makes sense. Though you could have deliberate coordinated training even after deployment. For instance, I'm particularly interested in the question of "how will agents learn to interact in high stakes circumstances which they will rarely encounter?" One could imagine the overseers of AI systems coordinating to fine-tune their systems in simulations of such encounters even after deployment. Not sure how plausible that is though.

1Daniel Kokotajlo3y
I totally agree it could be done, I'm just saying I think it won't happen without special effort on our part, probably. Rivals are suspicious of each other, and would probably be suspicious of a proposal like this coming from their rival. If they are even concerned about the problem it is trying to fix at all.

The new summary looks good =) Although I second Michael Dennis' comment below, that the infinite regress of priors is avoided in standard game theory by specifying a common prior. Indeed the specification of this prior leads to a prior selection problem.

The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven)

I’m not sure if you mean “there aren’t any theorems to be proven” or “any theorem that’s proven in this framework would be useless”. The former is false, e.g. there are things to prove ab

... (read more)
1Michael Dennis4y
Just to make sure that I was understood, I was also pointing out that "you can have a well-specified Bayesian belief over your partner" even without agreeing on a common prior, as long as you agree on a common set of possibilities or something effectively similar. This means that talking about "Bayesian agents without a common prior" is well-defined. When there is not a common prior, this lead to an arbitrarily deep nesting of beliefs, but they are all well-defined. I can refer to "what A believes that B believes about A" without running into Russell's Paradox. When the priors mis-match then the entire hierarchy of these beliefs might be useful to reason about, but when there is a common prior, it allows much of the hierarchy to collapse.
3Rohin Shah4y
I think once you settle on a "simple" welfare function, it is possible that there are _no_ Nash equilibria such that the agents are optimizing that welfare function (I don't even really know what it means to optimize the welfare function, given that you have to also punish the opponent, which isn't an action that is useful for the welfare function). Hmm, I meant one thing and wrote another. I meant to say "there aren't any theorems proven in this post".

both players want to optimize the welfare function (making it a collaborative game)

The game is collaborative in the sense that a welfare function is optimized in equilibrium, but the principals will in general have different terminal goals (reward functions) and the equilibrium will be enforced with punishments (cf. tit-for-tat).

the issue is primarily that in a collaborative game, the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arb

... (read more)
2Rohin Shah4y
Ah, I misunderstood your post. I thought you were arguing for problems conditional on the principals agreeing on the welfare function to be optimized, and having common knowledge that they were designing agents that optimize that welfare function. I mean, in this case you just deploy one agent instead of two. Even under the constraint that you must deploy two agents, you exactly coordinate their priors / which equilibria they fall into. To get prior / equilibrium selection problems, you necessarily need to have agents that don't know who their partner is. (Even if just one agent knows who the partner is, outcomes should be expected to be relatively good, though not optimal, e.g. if everything is deterministic, then threats are never executed.) ---- Looking at these objections, I think probably what you were imagining is a game where the principals have different terminal goals, but they coordinate by doing the following: * Agreeing upon a joint welfare function that is "fair" to the principals. In particular, this means that they agree that they are "licensed" to punish actions that deviate from this welfare function. * Going off and building their own agents that optimize the welfare function, but make sure to punish deviations (to ensure that the other principal doesn't build an agent that pursues the principal's goals instead of the welfare function) New planned summary: New opinion: Btw, some reasons I prefer not using priors / equilibria and instead prefer just saying "you don't know who your partner is": * It encourages solutions that take advantage of optimality and won't actually work in the situations we actually face. * The formality of "priors / equilibria" doesn't have any benefit in this case (there aren't any theorems to be proven). The one benefit I see is that it signals that "no, even if we formalize it, the problem doesn't go away", to those people who think that once formalized sufficiently all problems go away via the magic of Bayesia

Chimpanzees, crows, and dolphins are capable of impressive feats of higher intelligence, and I don’t think there’s any particular reason to think that Neanderthals are capable of doing anything qualitatively more impressive

This seems like a pretty cursory treatment of what seems like quite a complicated and contentious subject. A few possible counterexamples jump to mind. These are just things I remember coming across when browsing cognitive science sources over the years.

... (read more)

In model-free RL, policy-based methods choose policies by optimizing a noisy estimate of the policy's value. This is analogous to optimizing a noisy estimate of prediction accuracy (i.e., accuracy on the training data) to choose a predictive model. So we often need to trade variance for bias in the policy-learning case (i.e., shrink towards simpler policies) just as in the predictive modeling case.

Maybe pedantic but, couldn't we just look at the decision process as a sequence of episodes from the POMDP, and formulate the problem in terms of the regret incurred by our learning algorithm in this decision process? In particular, if catastrophic outcomes (i.e., ones which dominate the total regret) are possible, then a low-regret learning algorithm will have to be safe while still gathering some information that helps in future episodes. (On this view, the goal of safe exploration research is the same as the goal of learning generally: design low-regret

... (read more)

Do you think focusing on s-risks leads to meaningfully different technical goals than focusing on other considerations?

I think it definitely leads to a difference in prioritization among the things one could study under the broad heading of AI safety. Hopefully this will be clear in the body of the agenda. And, some considerations around possible downsides of certain alignment work might be more salient to those focused on s-risk; the possibility that attempts at alignment with human values could lead to very bad “near misses” is an example. (I think so

... (read more)
2Matthew "Vaniver" Gray4y
Thanks, that helps! Cool; if your deliberations include examples, it might be useful to include them if you end up writing an explanation somewhere.