Note: after putting this online, I noticed several problems with my original framing of the arguments. While I don't think they invalidated the overall conclusion, they did (ironically enough) make the post much less coherent. The version below has been significantly edited in an attempt to alleviate these issues.
Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function. In this post I dig deeper into this disagreement, concluding that Rohin is broadly correct, although the issue is more complex than he makes it out to be. Here’s Eliezer’s summary of his original argument:
Violations of coherence constraints in probability theory and decision theory correspond to qualitatively destructive or dominated behaviors. Coherence violations so easily computed as to be humanly predictable should be eliminated by optimization strong enough and general enough to reliably eliminate behaviors that are qualitatively dominated by cheaply computable alternatives. From our perspective this should produce agents such that, ceteris paribus, we do not think we can predict, in advance, any coherence violation in their behavior.
First we need to clarify what Eliezer means by coherence. He notes that there are many formulations of coherence constraints: restrictions on preferences which imply that an agent which obeys them is maximising the expectation of some utility function. I’ll take the standard axioms of VNM utility as one representative set of constraints. In this framework, we consider a set O of disjoint outcomes. A lottery is some assignment of probabilities to the elements of O such that they sum to 1. For any pair of lotteries, an agent can either prefer one to the other, or to be indifferent between them; let P be the function (from pairs of lotteries to a choice between them) defined by these preferences. The agent is incoherent if P violates any of the following axioms: completeness, transitivity, continuity, and independence. Eliezer gives several examples of how an agent which violates these axioms can be money-pumped, which is an example of the “destructive or dominated” behaviour he mentions in the quote above. And by the VNM theorem, any agent which doesn’t violate these axioms has preferences which are equivalent to maximising the expectation of some utility function over O (a function mapping the outcomes in O to real numbers).
It’s crucial to note that, in this setup, coherence is a property of an agent’s preferences at a single point in time. The outcomes that we are considering are all mutually exclusive, so an agent’s preferences over other outcomes are irrelevant after one outcome has already occurred. In addition, preferences are not observed but rather hypothetical: since outcomes are disjoint, we can’t actually observe the agent choosing a lottery and receiving a corresponding outcome (more than once).¹ And those hypothetical choices are always between known lotteries with fixed probabilities, rather than being based on our subjective probability estimates as they are in the real world. But Eliezer’s argument above makes use of a version of coherence which doesn't possess any of these traits: it is a property of the observed behaviour of agents with imperfect information, over time. VNM coherence is not well-defined in this setup, so if we want to formulate a rigorous version of this argument, we’ll need to specify a new definition of coherence which extends the standard instantaneous-hypothetical one.
A first step is to introduce the element of time, by changing the one-off choice between lotteries to repeated choices. A natural tool to use here is the Markov Decision Process (MDP) formalism: at each timestep, an agent chooses one of the actions available in its current state, which leads it to a new state according to a (possibly nondeterministic) transition function, resulting in a corresponding reward. We can think of our own world as a MDP (without rewards), in which a state is a snapshot of the entire universe at a given instant. We can then define a trajectory as a sequence of states and actions which goes from the starting state of an MDP to a terminal state. In the real world, this corresponds to a complete description of one way in which the universe could play out from beginning to end.
Here are two ways in which we could define an agent's preferences in the context of an MDP:
- Definition 1: the agent has preferences over states, and wants to spend its time in its preferred states, regardless of which order it visits them or what its past trajectory looked like. This is equivalent to the agent wanting to maximise the rewards it receives from some reward function defined over states.
- Definition 2: the agent's preferences are choices between lotteries over entire state-action trajectories it could take through the MDP. (In this case, we can ignore the rewards.)
Under both of these definitions, we can characterise incoherence in a similar way as in the classic VNM rationality setup, by evaluating the agent's preferences over outcomes. To be clear on the difference between them, under definition 1 an outcome is a state, one of which occurs every timestep, and a coherent agent's preferences over them are defined without reference to any past events. Whereas under definition 2 an outcome is an entire trajectory (composed of a sequence of states and actions), only one of which ever occurs, and a coherent agent’s preferences about the future may depend on what happened in the past in arbitrary ways. To see how this difference plays out in practice, consider the following example of non-transitive travel preferences: an agent which pays $50 to go from San Francisco to San Jose, then $50 to go from San Jose to Berkeley, then $50 to go from Berkeley to San Francisco (note that the money in this example is just a placeholder for anything the agent values). Under definition 1, the agent violates transitivity, and is incoherent. Under definition 2, it could just be that the agent prefers trajectories in which it travels round in a circle, compared with other available trajectories. Since Eliezer uses this situation as an example of incoherence, it seems like he doesn't intend preferences to be defined over trajectories. So let’s examine definition 1 in more detail.
When we do so, we find that it has several shortcomings - in particular, it rules out some preferences which seem to be reasonable and natural ones. For example, suppose you want to write a book which is so timeless that at least one person reads it every year for the next thousand years. There is no single point at which the state of the world contains enough information to determine whether you’ve succeeded or failed in this goal: in any given year there may be no remaining record of whether somebody read it in a previous year (or the records could have been falsified, etc). This goal is fundamentally a preference over trajectories.² In correspondence, Rohin gave me another example: someone whose goal is to play a great song in its entirety, and who isn’t satisfied with the prospect of playing the final note while falsely believing that they’ve already played the rest of the piece. More generally, I think that virtue-ethicists and deontologists are more accurately described as caring about world-trajectories than world-states - and almost all humans use these theories to some extent when choosing their actions. Meanwhile Eric Drexler’s CAIS framework relies on services which are bounded in time taken and resources used - another constraint which can’t be expressed just in terms of individual world-states.
At this point it may seem that definition 2 is superior, but unfortunately it fails badly once we introduce the distinction between hypothetical and observed preferences, by specifying that we only get to observe the agent's behaviour in the MDP over N timesteps. Previously we'd still been assuming that we could elicit the agent's hypothetical preferences about every possible pair of lotteries, and judge its coherence based on those. What would it instead mean for its behaviour to be incoherent?
- Under definition 1, given some reward function R, the value of an action can be defined using Bellman equations as the expected reward from the resulting transition, plus the expected value of the best action available at the next timestep. Then we can define an agent to be coherent iff there is some R such that the agent is only ever observed to take the highest-value action available to it.³
- Under definition 2, let P be the agent's policy. Then each action gives rise to a distribution over trajectories, and so we can interpret each choice of action taken as a choice between lotteries over trajectories (in a way which depends on P, since the agent needs to predict how its future self will behave). Now we define an agent to be coherent iff there is some policy P and some coherent preference function Q such that all observed choices are consistent with Q given the assumption that the agent will continue following P.
It turns out that under definition 2, any sequence of actions is coherent, since there's always a preference function under which the trajectory that actually occurred was the best one possible (as Rohin pointed out here). I think this is a decisive objection to making claims about agents appearing coherent using definition 2, and so we're left with definition 1. But note that there is no coherence theorem which says that an agent’s preferences need to be defined over states instead of trajectories, and in fact I've argued above that the latter is a more plausible model of humans. So even if definition 1 turns out to be a useful one, it would take additional arguments to show that we should expect that sort of coherence from advanced AIs, rather than (trivial) coherence with respect to trajectories. I'm not aware of any compelling arguments along those lines.
And in fact, definition 1 turns out to have further problems. For example: I haven't yet defined how a coherent agent is meant to choose between equally good options. One natural approach is to simply allow it to make any choice in those situations - it can hardly be considered irrational for doing so, since by assumption whatever it chooses is just as good as any other option. However, in that case any behaviour is consistent with the indifferent preference function (which rates all outcomes as equal). So even under definition 1, any sequence of actions is coherent. Now, I don't think it's very realistic that superintelligent AGIs will actually be indifferent about the effects of most of their actions, so perhaps we can just rule out preferences which feature indifference too often. But note that this adds an undesirable element of subjectivity to our definition.
That subjectivity is exacerbated when we try to model the fact that decisions in the real world are made under conditions of imperfect information. I won't cover this in detail, but the basic idea is that we change the setting from a MDP to a partially-observable MDP (aka POMDP), and instead of requiring coherent agents to take the actions which are actually best according to their preferences, they simply need to take the actions which are best according to their beliefs. How do we know what their beliefs are? We can't deduce them from agents' behaviour, and we can't just read them off from internal representations (at least, not in general). I think the closest we can get is to say that an agent is coherent if there is any prior belief state and any coherent preference function such that, if we assume that it updates its beliefs via Bayesian conditionalisation, the agent always takes the action which it believes to be best. Unfortunately (but unsurprisingly), we've yet again defined incoherence out of existence. In this case, given that we can only observe a bounded number of the agent's actions, there's always some pathological prior which justifies its behaviour. We could address this problem by adding the constraint that the prior needs to be a "reasonable" one, but this is a very vague term, and there's no consensus on what it actually means.
There’s a final issue with the whole setup of an agent traversing states: in the real world, and in examples like non-transitive travel, we never actually end up in quite the same state we started in. Perhaps we’ve gotten sunburned along the journey. Perhaps we spent a few minutes editing our next blog post. At the very least, we’re now slightly older, and we have new memories, and the sun’s position has changed a little. And so, just like with definition 2, no series of choices can ever demonstrate incoherent revealed preferences in the sense of definition 1, since every choice actually made is between a different set of possible states. (At the very least, they differ in the agent’s memories of which path it took to get there.⁴ And note that outcomes which are identical except for slight differences in memories should sometimes be treated in very different ways, since having even a few bits of additional information from exploration can be incredibly advantageous.)
Now, this isn’t so relevant in the human context because we usually abstract away from the small details. For example, if I offer to sell you an ice-cream and you refuse it, and then I offer it again a second later and you accept, I’d take that as evidence that your preferences are incoherent - even though technically the two offers are different because accepting the first just leads you to a state where you have an ice-cream, while accepting the second leads you to a state where you both have an ice-cream and remember refusing the first offer. Similarly, I expect that you don’t consider two outcomes to be different if they only differ in the precise pattern of TV static or the exact timing of leaves rustling. But again, there are no coherence constraints saying that an agent can’t consider such factors to be immensely significant, enough to totally change their preferences over lotteries when you substitute in one such outcome for the other.
So for the claim that sufficiently optimised agents appear coherent to be non-trivially true under definition 1, we’d need to clarify that such coherence is only with respect to outcomes when they’re categorised according to the features which humans consider important, except for the ones which are intrinsically temporally extended, conditional on the agent have a reasonable prior and not being indifferent over too many options. But then the standard arguments from coherence constraints no longer apply, because they're based on maths, not the ill-defined concepts used in the previous sentence. At this point I think it’s better to abandon the whole idea of formal coherence as a predictor of real-world behaviour, and replace it with Rohin’s notion of “goal-directedness”, which is more upfront about being inherently subjective, and doesn’t rule out any of the goals that humans actually have.
Thanks to Tim Genewein, Ramana Kumar, Victoria Krakovna, Rohin Shah, Toby Ord and Stuart Armstrong for discussions which led to this post, and helpful comments.
 Disjointedness of outcomes makes this argument more succinct, but it’s not actually a necessary component, because once you’ve received one outcome, your preferences over all other outcomes are allowed to change. For example, having won $1000000, the value you place on other financial prizes will very likely go down. This is related to my later argument that you never actually have multiple paths to ending up in the “same” state.
 At this point you could object on a technicality: from the unitarity of quantum mechanics, it seems as if the laws of physics are in fact reversible, and so the current state of the universe (or multiverse, rather) actually does contain all the information you theoretically need to deduce whether or not any previous goal has been satisfied. But I’m limiting this claim to macroscopic-level phenomena, for two reasons. Firstly, I don’t think our expectations about the behaviour of advanced AI should depend on very low-level features of physics in this way; and secondly, if the objection holds, then preferences over states have all the same problems as preferences over trajectories.
 Technical note: I’m assuming an infinite time horizon and no discounting, because removing either of those conditions leads to weird behaviour which I don’t want to dig into in this post. In theory this leaves open the possibility of infinite expected reward, or of lotteries over infinitely many outcomes, but I think that we can just ignore these cases without changing the core idea behind my argument. The underlying assumption here is something like: whether we model the universe as finite or infinite shouldn’t significantly affect whether we expect AI behaviour to be coherent over the next few centuries, for any useful definition of coherent.
 Perhaps you can construct a counterexample involving memory loss, but this doesn’t change the overall point, and if you’re concerned with such technicalities you’ll also have to deal with the problems I laid out in footnote 2.
I wonder if we can rescue Eliezer's argument. Informally (as far as I understand it) Eliezer's argument is that if an agent is the result of some optimization process, that optimization process will tend to notice and fix any incoherent behavior in the agent because that behavior will likely cause the agent to do something that counts as a clear loss from the optimization process's perspective.
So instead of letting O be either world states or world trajectories, make it the set of all possible combinations of properties of world trajectories that optimization processes in our world might care about. Formally we can define this as a partition of all possible world trajectories into mutually exclusive subsets where two trajectories are in the same subset iff no optimization process in our light-cone is likely to distinguish between them in any way. (BTW I believe it's standard or at least not unusual in decision theory to think of O as coarse-grained outcomes that people might care about, rather than micro states or micro trajectories.)
Now Rohin's objection no longer applies because we can't always find "a utility function which assigns maximal utility to all and only the world-trajectories in which those choices were made". Consider an agent that twitches according to some random sequence R. Since no optimization process in our world is likely to care that an agent twitches exactly according to R, any element of O that contains a trajectory where the agent twitches according to R would also contain a trajectory where the agent twitches according to some other sequence R', so there is no utility function which assigns maximal utility to all and only the world-trajectories in which the agent twitches according to R.
Having (hopefully) formalized the argument in a way that is no longer vacuous, I have to say I'm not entirely sure what the larger point of it is. Rohin seems to think the point is "Simply knowing that an agent is intelligent lets us infer that it is goal-directed" but Eliezer doesn't seem to think that corrigible (hence not goal-directed) agents are impossible to build. (That's actually one of MIRI's research objectives even though they take a different approach from Paul's.) Can anyone link to places where Eliezer uses this argument as part of some larger argument?
Here's an example of Eliezer using the argument: AI Alignment: Why It’s Hard, and Where to Start
From Rohin's post, a quote which I also endorse:
And if you're going to argue based on particular empirical facts about what goals we expect, then I don't think that doing so via coherence arguments helps very much.
I note that the first sentence of your post is "Rohin Shah has recently criticised Eliezer’s argument that “sufficiently optimised agents appear coherent”, on the grounds that any behaviour can be rationalised as maximisation of the expectation of some utility function." so it seems worth pointing out that there's a reasonable way to interpret “sufficiently optimised agents appear coherent” which isn't subject to that criticism.
Beyond that, as I mentioned, it's not clear to me what Eliezer was arguing for. (It seems plausible that he considered “sufficiently optimised agents appear coherent”, or the immediate corollary that such agents can be viewed as approximate EU maximizers with utility functions over the O that I defined, interesting in itself as a possibly surprising prediction that we can make about such agents.) What larger conclusion do you think he was arguing for, and why (preferably with citations)? Once we settle that, maybe then we can discuss whether his argumentative strategy was a good one?
I think the point (from Eliezer's perspective) is "Simply knowing that an agent is intelligent lets us infer that it is an expected utility maximizer". The main implication is that there is no way to affect the details of a superintelligent AI except by affecting its utility function, since everything else is fixed by math (specifically the VNM theorem). Note that this is (or rather, appears to be) a very strong condition on what alignment approaches could possibly work -- you can throw out any approach that isn't going to affect the AI's utility function. I think this is the primary reason for Eliezer making this argument. Let's call this the "intelligence implies EU maximization" claim.
Separately, there is another claim that says "EU maximization by default implies goal-directedness" (or the presence of convergent instrumental subgoals, if you prefer that instead of goal-directedness). However, this is not required by math, so it is possible to avoid this implication, by designing your utility function in just the right way.
Corrigibility is possible under this framework by working against the second claim, i.e. designing the utility function in just the right way that you get corrigible behavior out. And in fact this is the approach to corrigibility that MIRI looked into.
I am primarily taking issue with the "intelligence implies EU maximization" argument. The problem is, "intelligence implies EU maximization" is true, it just happens to be vacuous. So I can't say that that's what I'm arguing against. This is why I rounded it off to arguing against "intelligence implies goal-directedness", though this is clearly a bad enough summary that I shouldn't be saying that any more.
Eliezer explicitly disclaimed this:
In Relevant powerful agents will be highly optimized he went into even more detail about how one might create an intelligent agent that is not "highly optimized" and hence not an expected utility maximizer.
In summary it seems like you misunderstood Eliezer due to not noticing a distinction that he draws between "intelligent" (or "cognitively powerful") and "highly optimized".
That's true, I'm not sure what this distinction is meant to capture. I'm updating that the thing I said is less likely to be true, but I'm still somewhat confident that it captures the general gist of what Eliezer meant. I would bet on this at even odds if there were some way to evaluate it.
This is a tiny bit of his writing, and his tone makes it clear that this is unlikely. This is different from what I expected (when something has the force of a theorem you don't usually call its negation just "unlikely" and have a story for how it could be true), but it still seems consistent with the general story I said above.
In any case, I don't want to spend any more time figuring out what Eliezer believes, he can say something himself if he wants. I mostly replied to this comment to clarify the particular argument I'm arguing against, which I thought Eliezer believed, but even if he doesn't it seems like a common implicit belief in the rationalist AI safety crowd and should be debunked anyway.
It seems fine to debunk what you think is a common implicit belief in the rationalist AI safety crowd, but I think it's important to be fair to other researchers and not attribute errors to them when you don't know or aren't sure that they actually committed such errors. For people who aren't domain experts (which is most people), reputation is highly important for them to evaluate claims in a technical field like AI safety, so we should take care not to misinform them about, for example, how often someone makes technical errors.
I'm pretty sure I have never mentioned Eliezer in the Value Learning sequence. I linked to his writings because they're the best explanation of the perspective I'm arguing against. (Note that this is different from claiming that Eliezer believes that perspective.) This post and comment thread attributed the argument and belief to Eliezer, not me. I responded because it was specifically about what I was arguing against in my post, and I didn't say "I am clarifying the particular argument I am arguing against and am unsure what Eliezer's actual position is" because a) I did think that it was Eliezer's actual position, b) this is a ridiculous amount of boilerplate and c) I try not to spend too much time on comments.
I'm not feeling particularly open to feedback currently, because honestly I think I take far more care about this sort of issue than the typical researcher, but if you want to list a specific thing I could have done differently, I might try to consider how to do that sort of thing in the future.
Just a note that in the link that Wei Dai provides for "Relevant powerful agents will be highly optimized", Eliezer explicitly assigns '75%' to 'The probability that an agent that is cognitively powerful enough to be relevant to existential outcomes, will have been subject to strong, general optimization pressures.'
Yeah, it's worth noting that I don't understand what this means. By my intuitive read of the statement, I'd have given it 95+% of being true, in the sense that you aren't going to randomly stumble upon a powerful agent. But also by my intuitive read, the negative example given on that page would be a positive example:
On my view, known algorithms are already very optimized? E.g. Dijkstra's algorithm is highly optimized for efficient computation of shortest paths.
So TL;DR idk what optimized is supposed to mean here.
This seems pretty false to me. If you can predict in advance that some future you will be optimizing for something else, you could trade with future "you" and merge utility functions, which seems strictly better than not. (Side note: I'm pretty annoyed with all the use of "there's no coherence theorem for X" in this post.)
As a separate note, the "further out" your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.
(This is a reason in my mind to prefer the approval-directed-agent frame, in which humans get to inject preferences that are more about trajectories.)
I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn't considered your objection though, and will need to think more about it.
Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant?
Also true if you make world states temporally extended.
For what it's worth, under any continuous distribution over reward functions, only a measure zero subset of reward functions has more than one optimal trajectory from any state. So, it's a little less subjective to rule out indifference (assume continuity and ignore measure zero events), but it still subjective and doesn't deal with the other problems with defn 1.