[Epistemic status: my current view, but I haven’t read all the stuff on this topic even in the LessWrong community, let alone more broadly.]
There is a line of thought that says that advanced AI will tend to be ‘goal-directed’—that is, consistently doing whatever makes certain favored outcomes more likely—and that this is to do with the ‘coherence arguments’. Rohin Shah, and probably others1, have argued against this. I want to argue against them.
I’d reconstruct the original argument that Rohin is arguing against as something like this (making no claim about my own beliefs here):
And since the point of all this is to argue that advanced AI might be hard to deal with, note that we can get to that conclusion with:
Rohin’s counterargument begins with an observation made by others before: any behavior is consistent with maximizing expected utility, given some utility function. For instance, a creature just twitching around on the ground may have the utility function that returns 1 if the agent does whatever it in fact does in each situation (where ‘situation’ means, ‘entire history of the world so far’), and 0 otherwise. This is a creature that just wants to make the right twitch in each detailed, history-indexed situation, with no regard for further consequences. Alternately the twitching agent might care about outcomes, but just happen to want the particular holistic unfolding of the universe that is occurring, including this particular series of twitches. Or it could be indifferent between all outcomes.
The basic point is that rationality doesn’t say what ‘things’ you can want. And in particular, it doesn’t say that you have to care about particular atomic units that larger situations can be broken down into. If I try to call you out for first spending money to get to Paris, then spending money to get back from Paris, there is nothing to say you can’t just have wanted to go to Paris for a bit and then to come home. In fact, this is a common human situation. ‘Aha, I money pumped you!’ says the airline, but you aren’t worried. The twitching agent might always be like this—a creature of more refined tastes, who cares about whole delicate histories and relationships, rather than just summing up modular momentarily-defined successes. And given this freedom, any behavior might conceivably be what a creature wants.
Then I would put the full argument, as I understand it, like this:
Is this just some disagreement about the meaning of the word ‘goal-directed’? No, because we can get back to a major difference in physical expectations by adding:
So where the original argument says that the coherence arguments plus some other assumptions imply danger from AI, this counterargument says that they do not.
(There is also at least some variety in the meaning of ‘goal-directed’. I’ll use goal-directedRohin to refer to what I think is Rohin’s preferred usage: roughly, that which seems intuitively goal directed to us, e.g. behaving similarly across situations, and accruing resources, and not flopping around in possible pursuit of some exact history of personal floppage, or peaceably preferring to always take the option labeled ‘A’.6)
What’s wrong with Rohin’s counterargument? It sounded tight.
In brief, I see two problems:
You might then think that a probabilistic version still applies: since every entity appears to be in good standing with the coherence arguments, the arguments don’t exert any force, probabilistically, on what entities we might see. But:
Perhaps Rohin only meant to argue about whether it is logically possible to be coherent and not goal-directed-seeming, for the purpose of arguing that humanity can construct creatures in that perhaps-unlikely-in-nature corner of mindspace, if we try hard. In which case, I agree that it is logically possible. But I think his argument is often taken to be relevant more broadly, to questions of whether advanced AI will tend to be goal-directed, or to be goal-directed in places where they were not intended to be.
I take 1) to be fairly clear. I’ll lay out 2) in more detail.
Let us step back.
How would coherence arguments affect an AI system—or anyone—anyway? They’re not going to fly in from the platonic realm and reshape irrational creatures.
The main routes, as I see it, are via implying:
To be clear, the agent, the makers, or the world are not necessarily thinking about the arguments here—the arguments correspond to incentives in the world, which these parties are responding to. So I’ll often talk about ‘incentives for coherence’ or ‘forces for coherence’ rather than ‘coherence arguments’.
I’ll talk more about 1 for simplicity, expecting 2 and 3 to be similar, though I haven’t thought them through.
If self-adjustment is the mechanism for the coherence, this doesn’t depend on what a sequence of actions looks like from the outside, but from what it looks like from the inside.
Consider the aforementioned creature just twitching sporadically on the ground. Let’s call it Alex.
As noted earlier, there is a utility function under which Alex is maximizing expected utility: the one that assigns utility 1 to however Alex in fact acts in every specific history, and utility 0 to anything else.
But from the inside, this creature you excuse as ‘maybe just wanting that series of twitches’ has—let us suppose—actual preferences and beliefs. And if its preferences do not in fact prioritize this elaborate sequence of twitching in an unconflicted way, and it has the self-awareness and means to make corrections, then it will make corrections8. And having done so, its behavior will change.
Thus excusable-as-coherent Alex is still moved by coherence arguments, even while the arguments have no complaints about its behavior per se.
For a more realistic example: suppose Assistant-Bot is observed making this sequence of actions:
This is consistent with coherence: Assistant-Bot might prefer that exact sequence of actions over all others, or might prefer incurring gym costs with a larger sum of prime factors, or might prefer talking to Gym-sales-bot over ending the conversation, or prefer agreeing to things.
But suppose that in fact, in terms of the structure of the internal motivations producing this behavior, Assistant-Bot just prefers you to have a gym membership, and prefers you to have a better membership, and prefers you to have money, but is treating these preferences with inconsistent levels of strength in the different comparisons. Then there appears to be a coherence-related force for Assistant-Bot to change. One way that that could look is that since Assistant-Bot’s overall behavioral policy currently entails giving away money for nothing, and also Assistant-Bot prefers money over nothing, that preference gives Assistant-Bot reason to alter its current overall policy, to avert the ongoing exchange of money for nothing.9 And if its behavioral policy is arising from something like preferences, then the natural way to alter it is via altering those preferences, and in particular, altering them in the direction of coherence.
One issue with this line of thought is that it’s not obvious in what sense there is anything inside a creature that corresponds to ‘preferences’. Often when people posit preferences, the preferences are defined in terms of behavior. Does it make sense to discuss different possible ‘internal’ preferences, distinct from behavior? I find it helpful to consider the behavior and ‘preferences’ of groups:
Suppose two cars are parked in driveways, each containing a couple. One couple are just enjoying hanging out in the car. The other couple are dealing with a conflict: one wants to climb a mountain together, and the other wants to swim in the sea together, and they aren’t moving because neither is willing to let the outing proceed as the other wants. ‘Behaviorally’, both cars are the same: stopped. But their internal parts (the partners) are importantly different. And in the long run, we expect different behavior: the car with the unconflicted couple will probably stay where it is, and the conflicted car will (hopefully) eventually resolve the conflict and drive off.
I think here it makes sense to talk about internal parts, separate from behavior, and real. And similarly in the single agent case: there are physical mechanisms producing the behavior, which can have different characteristics, and which in particular can be ‘in conflict’—in a way that motivates change—or not. I think it is also worth observing that humans find their preferences ‘in conflict’ and try to resolve them, which is suggests that they at least are better understood in terms of both behavior and underlying preferences that are separate from it.
So we have: even if you can excuse any seizuring as consistent with coherence, coherence incentives still exert a force on creatures that are in fact incoherent, given their real internal state (or would be incoherent if created). At least if they or their creator have machinery for noticing their incoherence, caring about it, and making changes.
Or put another way, coherence doesn’t exclude overt behaviors alone, but does exclude combinations of preferences, and preferences beget behaviors. This changes how specific creatures behave, even if it doesn’t entirely rule out any behavior ever being correct for some creature, somewhere.
That is, the coherence theorems may change what behavior is likely to appear amongst creatures with preferences.
Ok, but moving toward coherence might sound totally innocuous, since, per Rohin’s argument, coherence includes all sorts of things, such as absolutely any sequence of behavior.
But the relevant question is again whether a coherence-increasing reform process is likely to result in some kinds of behavior over others, probabilistically.
This is partly a practical question—what kind of reform process is it? Where a creature ends up depends not just on what it incoherently ‘prefers’, but on what kinds of things its so-called ‘preferences’ are at all10, and what mechanisms detect problems, and how problems are resolved.
My guess is that there are also things we can say in general. It’s is too big a topic to investigate properly here, but some initially plausible hypotheses about a wide range of coherence-reform processes:
These hypotheses suggest to me that the changes in behavior brought about by coherence forces favor moving toward goal-directednessRohin, and therefore at least weakly toward risk.
Together, this does not imply that advanced AI will tend to be goal-directedRohin. We don’t know how strong such forces are. Evidently not so strong that humans11, or our other artifacts, are whipped into coherence in mere hundreds of thousands of years12. If a creature doesn’t have anything like preferences (beyond a tendency to behave certain ways), then coherence arguments don’t obviously even apply to it (though discrepancies between the creature’s behavior and its makers’ preferences probably produce an analogous force13 and competitive pressures probably produce a similar force for coherence in valuing resources instrumental to survival). Coherence arguments mark out an aspect of the incentive landscape, but to say that there is an incentive for something, all things equal, is not to say that it will happen.
1) Even though any behavior could be coherent in principle, if it is not coherent in combination with an entity’s internal state, then coherence arguments point to a real force for different (more coherent) behavior.
2) My guess is that this force for coherent behavior is also a force for goal-directed behavior. This isn’t clear, but seems likely, and also isn’t undermined by Rohin’s argument, as seems commonly believed.
Two dogs attached to the same leash are pulling in different directions. Etching by J. Fyt, 1642
Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:
There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call behavioural EU-maximisation) is inadequate as a theory of goal-directedness, because it applies to any agent. In order to fix this problem, the main missing link is not a stronger (probabilistic) argument for why AGIs will be coherent EU-maximisers, but rather an explanation of what it even means for a real-world agent to be a coherent EU-maximiser, which we don't currently have.
By "behavioural EU-maximisation", I mean thinking of a utility function as something that we define purely in terms of an agent's behaviour. In response to this, you identify an alternative definition of expected utility maximisation which isn't purely behavioural, but also refers to an agent's internal features:
An outside observer being able to rationalize a sequence of observed behavior as coherent doesn’t mean that the behavior is actually coherent. Coherence arguments constrain combinations of external behavior and internal features—‘preferences’ and beliefs. So whether an actor is coherent depends on what preferences and beliefs it actually has.
But you don't characterise those internal features in a satisfactory way, or point to anyone else who does. The closest you get is in your footnote, where you fall back on a behavioural definition of preferences:
When exactly an aspect of these should be considered a ‘preference’ for the sake of this argument isn’t entirely clear to me, but would seem to depend on something like whether it tends to produce actions favoring certain outcomes over other outcomes across a range of circumstances
I'm sympathetic to this, because it's hard to define preferences without reference to behaviour. We just don't know enough about cognitive science yet to do so. But it means that your conception of EU-maximisation is still vulnerable to Rohin's criticisms of behavioural EU-maximisation, because you still have to extract preferences from behaviour.
From my perspective, then, claims like "Anything that weakly has goals has reason to reform to become an EU maximizer" (as made in this comment) miss the crux of the disagreement. It's not that I believe the claim is false; I just don't know what it means, and I don't think anyone else does either. Unfortunately the fact that their are theorems about EU maximisation in some restricted formalisms make people think that it's a concept which is well-defined in real-world agents to a much greater extent than it actually is.
Here's an exaggerated analogy to help convey what I mean by "well-defined concept". Characters in games often have an attribute called health points (HP), and die when their health points drop to 0. Conceivably you could prove a bunch of theorems about health points in a certain class of games, e.g. that having more is always good. Okay, so is having more health points always good for real-world humans (or AIs)? I mean, we must have something like the health point formalism used in games, because if we take too much damage, we die! Sure, some critics say that defining health points in terms of external behaviour (like dying) is vacuous - but health points aren't just about behaviour, we can also define them in terms of an agent's internal features (like the tendency to die in a range of circumstances).
I would say that EU is like "health points": a concept which is interesting to reason about in some formalisms, and which is clearly related to an important real-world concept, but whose relationship to that non-formal real-world concept we don't yet understand well. Perhaps continued investigation can fix this; I certainly hope so! But in the meantime, using "EU-maximisation" instead of "goal-directedness" feels similar to using "health points" as a substitute for "health" - its main effect is to obscure our conceptual confusion under a misleading layer of formalism, thereby making the associated arguments seem stronger than they actually are.
I love your health points analogy. Extending it, imagine that someone came up with "coherence arguments" that showed that for a rational doctor doing triage on patients, and/or for a group deciding who should do a risky thing that might result in damage, the optimal strategy involves a construct called "health points" such that:
--Each person at any given time has some number of health points
--Whenever someone reaches 0 health points, they (very probably) die
--Similar afflictions/disasters tend to cause similar amounts of decrease in health points, e.g. a bullet in the thigh causes me to lose 5 hp and you to lose 5 hp and Katja to lose 5hp.
Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?
This is so despite the fact that someone could come along and say "Well these coherence arguments assume a concept (our intuitive concept) of 'damage,' they don't tell us what 'damage' means. (Ditto for concepts like 'die' and 'person' and 'similar') That would be true, and it would still be a good idea to do further deconfusion research along those lines, but it wouldn't detract much from the epistemic victory the coherence arguments won.
Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)
But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for example, you say "our aim is to build an artificial liver with the most HP possible" - then I'm worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.
Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.
I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other. Where here the meaningful comes from thinking about cognition and what following such a utility function would entail. There's a pretty intuitive sense in which a utility function that encodes exactly a trajectory and nothing else, for a complex enough setting, doesn't look like a goal.
A difference between us I think is that I expect that we can add structure that restricts the set of utility functions we consider (structure that comes from thinking among other things about cognition) such that maximizing the expected utility for such a constrained utility function would actually capture most if not all the aspect of goal-directedness that matters to us.
My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts. So either adapting the state space and action space, or going for fixed spaces but mapping/equivalence classes/metrics on them that encode the relevant assumptions about cognition.
My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.
Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)
I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.
I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)
My first intuition is that I expect mapping internal concept to mathematical formalisms to be easier when the end goal is deconfusion and making sense of behaviors, compared to actually improving capabilities. But I'd have to think about it some more. Thanks at least for an interesting test to try to apply to my attempt.
Okay, do you mean that you agree with my paragraph but what you are really arguing about is that utility functions don't care about the low-level/internals of the system, and that's why they're bad abstractions? (That's how I understand your liver and health points example).
You're mistaken about the view I'm arguing against. (Though perhaps in practice most people think I'm arguing against the view you point out, in which case I hope this post helps them realize their error.) In particular:
Whatever things you care about, you are best off assigning consistent numerical values to them and maximizing the expected sum of those values
If you start by assuming that the agent cares about things, and your prior is that the things it cares about are "simple" (e.g. it is very unlikely to be optimizing the-utility-function-that-makes-the-twitching-robot-optimal), then I think the argument goes through fine. According to me, this means you have assumed goal-directedness in from the start, and are now seeing what the implications of goal-directedness are.
My claim is that if you don't assume that the agent cares about things, coherence arguments don't let you say "actually, principles of rationality tell me that since this agent is superintelligent it must care about things".
Stated this way it sounds almost obvious that the argument doesn't work, but I used to hear things that effectively meant this pretty frequently in the past. Those arguments usually go something like this:
This talk for example gives the impression that this sort of argument works. (If you look carefully, you can see that it does state that the AI is programmed to have "objects of concern", which is where the goal-directedness assumption comes in, but you can see why people might not notice that as an assumption.)
You might think "well, obviously the superintelligent AI system is going to care about things, maybe it's technically an assumption but surely that's a fine assumption". I think on balance I agree, but it doesn't seem nearly so obvious to me, and seems to depend on how exactly the agent is built. For example, it's plausible to me that superintelligent expert systems would not be accurately described as "caring about things", and I don't think it was a priori obvious that expert systems wouldn't lead to AGI. Similarly, it seems at best questionable whether GPT-3 can be accurately described as "caring about things".
As to whether this argument is relevant for whether we will build goal-directed systems: I don't think that in isolation my argument should strongly change your view on the probability you assign to that claim. I see it more as a constraint on what arguments you can supply in support of that view. If you really were just saying "VNM theorem, therefore 99%", then probably you should become less confident, but I expect in practice people were not doing that and so it's not obvious how exactly their probabilities should change.
I'd appreciate advice on how to change the post to make this clearer -- I feel like your response is quite common, and I haven't yet figured out how to reliably convey the thing I actually mean.
Thanks. Let me check if I understand you correctly:
You think I take the original argument to be arguing from ‘has goals' to ‘has goals’, essentially, and agree that that holds, but don’t find it very interesting/relevant.
What you disagree with is an argument from ‘anything smart’ to ‘has goals’, which seems to be what is needed for the AI risk argument to apply to any superintelligent agent.
Is that right?
If so, I think it’s helpful to distinguish between ‘weakly has goals’ and ‘strongly has goals’:
So that the full argument I currently take you to be responding to is closer to:
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through. That is, creatures that are weakly goal directed are liable to become strongly goal directed. (e.g. an agent that twitches because it has various flickering and potentially conflicting urges toward different outcomes is liable to become an agent that more systematically seeks to bring about some such outcomes) Does that sound right?
If so, I think we agree. (In my intuition I characterize the situation as ‘there is roughly a gradient of goal directedness, and a force pulling less goal directed things into being more goal directed. This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
Yes, that's basically right.
Well, I do think it is an interesting/relevant argument (because as you say it explains how you get from "weakly has goals" to "strongly has goals"). I just wanted to correct the misconception about what I was arguing against, and I wanted to highlight the "intelligent" --> "weakly has goals" step as a relatively weak step in our current arguments. (In my original post, my main point was that that step doesn't follow from pure math / logic.)
In that case, my current understanding is that you are disagreeing with 2, and that you agree that if 2 holds in some case, then the argument goes through.
At least, the argument makes sense. I don't know how strong its effect is -- basically I agree with your phrasing here:
This force probably doesn’t exist out at the zero goal directness edges, but it unclear how strong it is in the rest of the space—i.e. whether it becomes substantial as soon as you move out from zero goal directedness, or is weak until you are in a few specific places right next to ‘maximally goal directed’.)
I wrote an AI Impacts page summary of the situation as I understand it. If anyone feels like looking, I'm interested in corrections/suggestions (either here or in the AI Impacts feedback box).
Looks good to me :)
A few quick thoughts on reasons for confusion:
I think maybe one thing going on is that I already took the coherence arguments to apply only in getting you from weakly having goals to strongly having goals, so since you were arguing against their applicability, I thought you were talking about the step from weaker to stronger goal direction. (I’m not sure what arguments people use to get from 1 to 2 though, so maybe you are right that it is also something to do with coherence, at least implicitly.)
It also seems natural to think of ‘weakly has goals’ as something other than ‘goal directed’, and ‘goal directed’ as referring only to ‘strongly has goals’, so that ‘coherence arguments do not imply goal directed behavior’ (in combination with expecting coherence arguments to be in the weak->strong part of the argument) sounds like ‘coherence arguments do not get you from ‘weakly has goals’ to ‘strongly has goals’.
I also think separating out the step from no goal direction to weak, and weak to strong might be helpful in clarity. It sounded to me like you were considering an argument from 'any kind of agent' to 'strong goal directed' and finding it lacking, and I was like 'but any kind of agent includes a mix of those that this force will work on, and those it won't, so shouldn't it be a partial/probabilistic move toward goal direction?' Whereas you were just meaning to talk about what fraction of existing things are weakly goal directed.
Thanks, that's helpful. I'll think about how to clarify this in the original post.
Maybe changing the title would prime people less to have the wrong interpretation? E.g., to 'Coherence arguments require that the system care about something'.
Even just 'Coherence arguments do not entail goal-directed behavior' might help, since colloquial "imply" tends to be probabilistic, but you mean math/logic "imply" instead. Or 'Coherence theorems do not entail goal-directed behavior on their own'.
Curated. Felt to me like a valuable step in this conversation, and analyzed some details helpfully to me. Thanks for writing it.
This seems consistent with coherence being not a constraint but a dimension of optimization pressure among several/many? Like environments that money pump more reliably will have stronger coherence pressure, but also the creature might just install a cheap hack for avoiding that particular pump (if narrow) which then loosens the coherence pressure (coherence pressure sounds expensive, so workarounds are good deals).
I feel like this post is the best current thing to link to for understanding the point of coherence arguments in AI Alignment, which I think are really crucial, and even in 2023 I still see lots of people make bad arguments either overextending the validity of coherence arguments, or dismissing coherence arguments completely in an unproductive way.
I think this is worth highlighting as something we too often ignore to our peril. Paying attention to internal parts is sometimes "annoying" in the sense that we can build much easier to reason about models by ignoring mechanisms and simply treating things, like AIs, as black boxes (or as made up of a small number of black boxes) with some behavior we can observe from the outside. But doing so will result in us consistently being surprised in ways we needn't have been.
For example, you treat two AIs as if they are EU maximizers and you model the utility function they are maximizing. But they actually behave in different ways in some situation even though the modeled utility function predicted the same behavior. And I don't think this is just a failure to make a good enough model of the utility function, I think it's fundamental, the way Goodheart is, that when we model something we necessarily are measuring it and so not getting the real thing, and will necessarily risk surprise when the model and the real thing do different things.
So the less detailed our models, and the more they ignore internals, the more we put ourselves at risk. Anyway, kind of a tangent from the post, but I feel like I constantly see simple models being used to try to explain AI that push out important internals in the name of simpler models that create real risks of confusing ourselves in attempts to build aligned AI.