One broad argument for AI risk is the Misspecified Goal argument:
The Misspecified Goal Argument for AI Risk: Very intelligent AI systems will be able to make long-term plans in order to achieve their goals, and if their goals are even slightly misspecified then the AI system will become adversarial and work against us.
My main goal in this post is to make conceptual clarifications and suggest how they affect the Misspecified Goal argument, without making any recommendations about what we should actually do. Future posts will argue more directly for a particular position. As a result, I will not be considering other arguments for focusing on AI risk even though I find some of them more compelling.
I think of this as a concern about long-term goal-directed behavior. Unfortunately, it’s not clear how to categorize behavior as goal-directed vs. not. Intuitively, any agent that searches over actions and chooses the one that best achieves some measure of “goodness” is goal-directed (though there are exceptions, such as the agent that selects actions that begin with the letter “A”). (ETA: I also think that agents that show goal-directed behavior because they are looking at some other agent are not goal-directed themselves -- see this comment.) However, this is not a necessary condition: many humans are goal-directed, but there is no goal baked into the brain that they are using to choose actions.
This is related to the concept of optimization, though with intuitions around optimization we typically assume that we know the agent’s preference ordering, which I don’t want to assume here. (In fact, I don’t want to assume that the agent even has a preference ordering.)
One potential formalization is to say that goal-directed behavior is any behavior that can be modelled as maximizing expected utility for some utility function; in the next post I will argue that this does not properly capture the behaviors we are worried about. In this post I’ll give some intuitions about what “goal-directed behavior” means, and how these intuitions relate to the Misspecified Goal argument.
Generalization to novel circumstances
Consider two possible agents for playing some game, let’s say TicTacToe. The first agent looks at the state and the rules of the game, and uses the minimax algorithm to find the optimal move to play. The second agent has a giant lookup table that tells it what move to play given any state. Intuitively, the first one is more “agentic” or “goal-driven”, while the second one is not. But both of these agents play the game in exactly the same way!
The difference is in how the two agents generalize to new situations. Let’s suppose that we suddenly change the rules of TicTacToe -- perhaps now the win condition is reversed, so that anyone who gets three in a row loses. The minimax agent is still going to be optimal at this game, whereas the lookup-table agent will lose against any opponent with half a brain. The minimax agent looks like it is “trying to win”, while the lookup-table agent does not. (You could say that the lookup-table agent is “trying to take actions according to <policy>”, but this is a weird complicated goal so maybe it doesn’t count.)
In general, when we say that an agent is pursuing some goal, this is meant to allow us to predict how the agent will generalize to some novel circumstance. This sort of reasoning is critical for the Goal-Directed argument for AI risk. For example, we worry that an AI agent will prevent us from turning it off, because that would prevent it from achieving its goal: “You can't fetch the coffee if you're dead.” This is a prediction about what an AI agent would do in the novel circumstance where a human is trying to turn the agent off.
This suggests a way to characterize these sorts of goal-directed agents: there is some goal such that the agent’s behavior in new circumstances can be predicted by figuring out which behavior best achieves the goal. There's a lot of complexity in the space of goals we consider: something like "human well-being" should count, but "the particular policy <x>" and “pick actions that start with the letter A” should not. When I use the word goal I mean to include only the first kind, even though I currently don’t know theoretically how to distinguish between the various cases.
Note that this is in stark contrast to existing AI systems, which are particularly bad at generalizing to new situations.
Honestly, I’m surprised it’s only 90%. 
We could also look at whether or not the agent acquires more power and resources. It seems likely that an agent that is optimizing for some goal over the long term would want more power and resources in order to more easily achieve that goal. In addition, the agent would probably try to improve its own algorithms in order to become more intelligent.
This feels like a consequence of goal-directed behavior, and not its defining characteristic, because it is about being able to achieve a wide variety of goals, instead of a particular one. Nonetheless, it seems crucial to the broad argument for AI risk presented above, since an AI system will probably need to first accumulate power, resources, intelligence, etc. in order to cause catastrophic outcomes.
I find this concept most useful when thinking about the problem of inner optimizers, where in the course of optimization through a rich space you stumble across a member of the space that is itself doing optimization, but for a related but still misspecified metric. Since the inner optimizer is being “controlled” by the outer optimization process, it is probably not going to cause major harm unless it is able to “take over” the outer optimization process, which sounds a lot like accumulating power. (This discussion is extremely imprecise and vague; see Risks from Learned Optimization for a more thorough discussion.)
Our understanding of the behavior
There is a general pattern in which as soon as we understand something, it becomes something lesser. As soon as we understand rainbows, they are relegated to the “dull catalogue of common things”. This suggests a somewhat cynical explanation of our concept of “intelligence”: an agent is considered intelligent if we do not know how to achieve the outcomes it does using the resources that it has (in which case our best model for that agent may be that it is pursuing some goal, reflecting our tendency to anthropomorphize). That is, our evaluation about intelligence is a statement about our epistemic state. Some examples that follow this pattern are:
- As soon as we understand how some AI technique solves a challenging problem, it is no longer considered AI. Before we’ve solved the problem, we imagine that we need some sort of “intelligence” that is pointed towards the goal and solves it: the only method we have of predicting what this AI system will do is to think about what a system that tries to achieve the goal would do. Once we understand how the AI technique works, we have more insight into what it is doing and can make more detailed predictions about where it will work well, where it tends to make mistakes, etc. and so it no longer seems like “intelligence”. Once you know that OpenAI Five is trained by self-play, you can predict that they haven’t seen certain behaviors like standing still to turn invisible, and probably won’t work well there.
- Before we understood the idea of natural selection and evolution, we would look at the complexity of nature and ascribe it to intelligent design; once we had the mathematics (and even just the qualitative insight), we could make much more detailed predictions, and nature no longer seemed like it required intelligence. For example, we can predict the timescales on which we can expect evolutionary changes, which we couldn’t do if we just modeled evolution as optimizing reproductive fitness.
- Many phenomena (eg. rain, wind) that we now have scientific explanations for were previously explained to be the result of some anthropomorphic deity.
- When someone performs a feat of mental math, or can tell you instantly what day of the week a random date falls on, you might be impressed and think them very intelligent. But if they explain to you how they did it, you may find it much less impressive. (Though of course these feats are selected to seem more impressive than they are.)
Note that an alternative hypothesis is that humans equate intelligence with mystery; as we learn more and remove mystery around eg. evolution, we automatically think of it as less intelligent.
To the extent that the Misspecified Goal argument relies on this intuition, the argument feels a lot weaker to me. If the Misspecified Goal argument rested entirely upon this intuition, then it would be asserting that because we are ignorant about what an intelligent agent would do, we should assume that it is optimizing a goal, which means that it is going to accumulate power and resources and lead to catastrophe. In other words, it is arguing that assuming that an agent is intelligent definitionally means that it will accumulate power and resources. This seems clearly wrong; it is possible in principle to have an intelligent agent that nonetheless does not accumulate power and resources.
Also, the argument is not saying that in practice most intelligent agents accumulate power and resources. It says that we have no better model to go off of other than “goal-directed”, and then pushes this model to extreme scenarios where we should have a lot more uncertainty.
To be clear, I do not think that anyone would endorse the argument as stated. I am suggesting as a possibility that the Misspecified Goal argument relies on us incorrectly equating superintelligence with “pursuing a goal” because we use “pursuing a goal” as a default model for anything that can do interesting things, even if that is not the best model to be using.
Intuitively, goal-directed behavior can lead to catastrophic outcomes with a sufficiently intelligent agent, because the optimal behavior for even a slightly misspecified goal can be very bad according to the true goal. However, it’s not clear exactly what we mean by goal-directed behavior. Often, an algorithm that searches over possible actions and chooses the one with the highest “goodness” will be goal-directed, but this is neither necessary nor sufficient.
“From the outside”, it seems like a goal-directed agent is characterized by the fact that we can predict the agent’s behavior in new situations by assuming that it is pursuing some goal, and as a result it is acquires power and resources. This can be interpreted either as a statement about our epistemic state (we know so little about the agent that our best model is that it pursues a goal, even though this model is not very accurate or precise) or as a statement about the agent (predicting the behavior of the agent in new situations based on pursuit of a goal actually has very high precision and accuracy). These two views have very different implications on the validity of the Misspecified Goal argument for AI risk.
 This is an entirely made-up number.
Do you have a citation for this? Who are you arguing against, or whose argument are you trying to clarify?
I tend to have a different version of the Misspecified Goal argument in mind which I think doesn't have this problem:
I briefly looked for and did not find a good citation for this.
I'm not sure. However, I have a lot of conversations where it seems to me that the other person believes the Misspecified Goal Argument. Currently, if I were to meet a MIRI employee I hadn't met before, I would be unsure whether the Misspecified Goal Argument is their primary reason for worrying about AI risk. If I meet a rationalist who takes the MIRI perspective on AI risk but isn't at MIRI themselves, by default I assume that their primary reason for caring about AI risk is the Misspecified Goal argument.
I do want to note that I am primarily trying to clarify here, I didn't write this as an argument against the Misspecified Goal argument. In fact, conditional on the AI having goals, I do agree with the Misspecified Goal argument.
Yeah, I think this is a good argument, and I want to defer to my future post on the topic, which should come out on Wednesday. The TL;DR is that I agree with the argument but it implies a broader space of potential solutions than "figure out how to align a goal-directed AI".
(Sorry that I didn't adequately point to different arguments and what I think about them -- I didn't do this because it would make for a very long post, and it's instead being split into several posts, and this particular argument happens to be in the post on Wednesday.)
My guess is that agents that are not primarily goal-directed can be good at defending against goal-directed agents (especially with first mover advantage, preventing goal-directed agents from gaining power), and are potentially more tractable for alignment purposes, if humans coexist with AGIs during their development and operation (rather than only exist as computational processes inside the AGI's goal, a situation where a goal concept becomes necessary).
I think the assumption that useful agents must be goal-directed has misled a lot of discussion of AI risk in the past. Goal-directed agents are certainly a problem, but not necessarily the solution. They are probably good for fixing astronomical waste, but maybe not AI risk.
I think I disagree with this at least to some extent. Humans are not generally safe agents, and in order for not-primarily-goal-directed AIs to not exacerbate humans' safety problems (for example by rapidly shifting their environments/inputs out of a range where they are known to be relatively safe), it seems that we have to solve many of the same metaethical/metaphilosophical problems that we'd need to solve to create a safe goal-directed agent. I guess in some sense the former has lower "AI risk" than the latter in that you can plausibly blame any bad outcomes on humans instead of AIs, but to me that's actually a downside because it means that AI creators can more easily deny their responsibility to help solve those problems.
Learning how to design goal-directed agents seems like an almost inevitable milestone on the path to figuring out how to safely elicit human preference in an actionable form. But the steps involved in eliciting and enacting human preference don't necessarily make use of a concept of preference or goal-directedness. An agent with a goal aligned with the world can't derive its security from the abstraction of goal-directedness, because the world determines that goal, and so the goal is vulnerable to things in the world, including human error. Only self-contained artificial goals are safe from the world and may lead to safety of goal-directed behavior. A goal built from human uploads that won't be updated from the world in the future gives safety from other things in the world, but not from errors of the uploads.
When the issue is figuring out which influences of the world to follow, it's not clear that goal-directedness remains salient. If there is a goal, then there is also a world-in-the-goal and listening to your own goal is not safe! Instead, you have to figure out which influences in your own goal to follow. You are also yourself part of the world and so there is an agent-in-the-goal that can decide aspects of preference. This framing where a goal concept is prominent is not obviously superior to other designs that don't pursue goals, and instead focus on pointing at the appropriate influences from the world. For example, a system may seek to make reliable uploads, or figure out which decisions of uploads are errors, or organize uploads to make sense of situations outside normal human environments, or be corrigible in a secure way, so as to follow directions of a sane external operator and not of an attacker. Once we have enough of such details figured out (none of which is a goal-directed agent), it becomes possible to take actions in the world. At that point, we have a system of many carefully improved kluges that further many purposes in much the same way as human brains do, and it's not clearly an improvement to restructure that system around a concept of goals, because that won't move it closer to the influences of the world it's designed to follow.