Note: This is the first post from part two: basic intuitions of the sequence on iterated amplification. The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal.
Research in AI is steadily progressing towards more flexible, powerful, and autonomous goal-directed behavior. This progress is likely to have significant economic and humanitarian benefits: it helps make automation faster, cheaper, and more effective, and it allows us to automate deciding what to do.
Many researchers expect goal-directed machines to predominate, and so have considered the long-term implications of this kind of automation. Some of these implications are worrying: if sophisticated artificial agents pursue their own objectives and are as smart as we are, then the future may be shaped as much by their goals as by ours.
Most thinking about “AI safety” has focused on the possibility of goal-directed machines, and asked how we might ensure that their goals are agreeable to humans. But there are other possibilities.
In this post I will flesh out one alternative to goal-directed behavior. I think this idea is particularly important from the perspective of AI safety.
Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action:
Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.
I’ll call this “approval-directed” behavior throughout this post, in contrast with goal-directed behavior. In this context I’ll call Hugh an “overseer.”
Arthur’s actions are rated more highly than those produced by any alternative procedure. That’s comforting, but it doesn’t mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can’t anticipate those consequences himself. For example, if Arthur is playing chess he should make moves that are actually good—not moves that Hugh thinks are good.
The quality of approval-directed decisions is limited by the minimum of Arthur’s ability and Hugh’s ability: Arthur makes a decision only if it looks good to both Arthur and Hugh. So why would Hugh be interested in this proposal, rather than doing things himself?
- Hugh doesn’t actually rate actions, he just participates in a hypothetical rating process. So Hugh can oversee many agents like Arthur at once (and spend his actual time relaxing on the beach). In many cases, this is the whole point of automation.
- Hugh can (hypothetically) think for a very long time about each decision—longer than would be practical or cost-effective if he had to actually make the decision himself.
- Similarly, Hugh can think about Arthur’s decisions at a very low level of detail. For example, Hugh might rate a chess-playing AI’s choices about how to explore the game tree, rather than rating its final choice of moves. If Arthur is making billions of small decisions each second, then Hugh can think in depth about each of them, and the resulting system can be much smarter than Hugh.
- Hugh can (hypothetically) use additional resources in order to make his rating: powerful computers, the benefit of hindsight, many assistants, very long time periods.
- Hugh’s capabilities can be gradually escalated as needed, and one approval-directed system can be used to bootstrap to a more effective successor. For example, Arthur could advise Hugh on how to define a better overseer; Arthur could offer advice in real-time to help Hugh be a better overseer; or Arthur could directly act as an overseer for his more powerful successor.
In most situations, I would expect approval-directed behavior to capture the benefits of goal-directed behavior, while being easier to define and more robust to errors.
Facilitate indirect normativity
Approval-direction is closely related to what Nick Bostrom calls “indirect normativity” — describing what is good indirectly, by describing how to tell what is good. I think this idea encompasses the most credible proposals for defining a powerful agent’s goals, but has some practical difficulties.
Asking an overseer to evaluate outcomes directly requires defining an extremely intelligent overseer, one who is equipped (at least in principle) to evaluate the entire future of the universe. This is probably impractical overkill for the kinds of agents we will be building in the near future, who don’t have to think about the entire future of the universe.
Approval-directed behavior provides a more realistic alternative: start with simple approval-directed agents and simple overseers, and scale up the overseer and the agent in parallel. I expect the approval-directed dynamic to converge to the desired limit; this requires only that the simple overseers approve of scaling up to more powerful overseers, and that they are able to recognize appropriate improvements.
Some approaches to AI require “locking in” design decisions. For example, if we build a goal-directed AI with the wrong goals then the AI might never correct the mistake on its own. For sufficiently sophisticated AI’s, such mistakes may be very expensive to fix. There are also more subtle forms of lock-in: an AI may also not be able to fix a bad choice of decision-theory, sufficiently bad priors, or a bad attitude towards infinity. It’s hard to know what other properties we might inadvertently lock-in.
Approval-direction involves only extremely minimal commitments. If an approval-directed AI encounters an unforeseen situation, it will respond in the way that we most approve of. We don’t need to make a decision until the situation actually arises.
Perhaps most importantly, an approval-directed agent can correct flaws in its own design, and will search for flaws if we want it to. It can change its own decision-making procedure, its own reasoning process, and its own overseer.
Approval-direction seems to “fail gracefully:” if we slightly mess up the specification, the approval-directed agent probably won’t be actively malicious. For example, suppose that Hugh was feeling extremely apathetic and so evaluated proposed actions only superficially. The resulting agent would not aggressively pursue a flawed realization of Hugh’s values; it would just behave lackadaisically. The mistake would be quickly noticed, unless Hugh deliberately approved of actions that concealed the mistake.
This looks like an improvement over misspecifying goals, which leads to systems that are actively opposed to their users. Such systems are motivated to conceal possible problems and to behave maliciously.
The same principle sometimes applies if you define the right overseer but the agent reasons incorrectly about it, if you misspecify the entire rating process, or if your system doesn’t work quite like you expect. Any of these mistakes could be serious for a goal-directed agent, but are probably handled gracefully by an approval-directed agent.
Similarly, if Arthur is smarter than Hugh expects, the only problem is that Arthur won’t be able to use all of his intelligence to devise excellent plans. This is a serious problem, but it can be fixed by trial and error—rather than leading to surprising failure modes.
Is it plausible?
I’ve already mentioned the practical demand for goal-directed behavior and why I think that approval-directed behavior satisfies that demand. There are other reasons to think that agents might be goal-directed. These are all variations on the same theme, so I apologize if my responses become repetitive.
We assumed that Arthur can predict what actions Hugh will rate highly. But in order to make these predictions, Arthur might use goal-directed behavior. For example, Arthur might perform a calculation because he believes it will help him predict what actions Hugh will rate highly. Our apparently approval-directed decision-maker may have goals after all, on the inside. Can we avoid this?
I think so: Arthur’s internal decisions could also be approval-directed. Rather than performing a calculation because it will help make a good prediction, Arthur can perform that calculation because Hugh would rate this decision highly. If Hugh is coherent, then taking individual steps that Hugh rates highly leads to overall behavior that Hugh would approve of, just like taking individual steps that maximize X leads to behavior that maximizes X.
In fact the result may be more desirable, from Hugh’s perspective, than maximizing Hugh’s approval. For example, Hugh might incorrectly rate some actions highly, because he doesn’t understand them. An agent maximizing Hugh’s approval might find those actions and take them. But if the agent was internally approval-directed, then it wouldn’t try to exploit errors in Hugh’s ratings. Actions that lead to reported approval but not real approval, don’t lead to approval for approved reasons
Turtles all the way down?
Approval-direction stops making sense for low-level decisions. A program moves data from register A into register B because that’s what the next instruction says, not because that’s what Hugh would approve of. After all, deciding whether Hugh would approve itself requires moving data from one register to another, and we would be left with an infinite regress.
The same thing is true for goal-directed behavior. Low-level actions are taken because the programmer chose them. The programmer may have chosen them because she thought they would help the system achieve its goal, but the actions themselves are performed because that’s what’s in the code, not because of an explicit belief that they will lead to the goal. Similarly, actions might be performed because a simple heuristic suggests they will contribute to the goal — the heuristic was chosen or learned because it was expected to be useful for the goal, but the action is motivated by the heuristic. Taking the action doesn’t involve thinking about the heuristic, just following it.
Similarly, an approval-directed agent might perform an action because it’s the next instruction in the program, or because it’s recommended by a simple heuristic. The program or heuristic might have been chosen to result in approved actions, but the taking the action doesn’t involve reasoning about approval. The aggregate effect of using and refining such heuristics is to effectively do what the user approves of.
In many cases, perhaps a majority, the heuristics for goal-directed and approval-directed behavior will coincide. To answer “what do I want this function to do next?” I very often ask “what do I want the end result to be?” In these cases the difference is in how we think about the behavior of the overall system, and what invariants we try to maintain as we design it.
Approval-directed subsystems might be harder to build than goal-directed subsystems. For example, there is much more data of the form “X leads to Y” than of the form “the user approves of X.” This is a typical AI problem, though, and can be approached using typical techniques.
Approval-directed subsystems might also be easier to build, and I think this is the case today. For example, I recently wrote a function to decide which of two methods to use for the next step of an optimization. Right now it uses a simple heuristic with mediocre performance. But I could also have labeled some examples as “use method A” or “use method B,” and trained a model to predict what I would say. This model could then be used to decide when to use A, when to use B, and when to ask me for more training data.
Rational goal-directed behavior is reflectively stable: if you want X, you generally want to continue wanting X. Can approval-directed behavior have the same property?
Approval-directed systems inherit reflective stability (or instability) from their overseers. Hugh can determine whether Arthur “wants” to remain approval-directed, by approving or disapproving of actions that would change Arthur’s decision-making process.
Goal-directed agents want to be wiser and know more, though their goals are stable. Approval-directed agents also want to be wiser and know more, but they also want their overseers to be wiser and know more. The overseer is not stable, but the overseer’s values are. This is a feature, not a bug.
Similarly, an agent composed of approval-directed subsystems overseen by Hugh is not the same as an approval-directed agent overseen by Hugh. For example, the composite may make decisions too subtle for Hugh to understand. Again, this is a feature, not a bug.
Black box search
(Note: I no longer agree with the conclusions of this section. I now feel that approval-directed agents can probably be constructed out of powerful black-box search (or stochastic gradient descent); my main priority is now either handling this setting or else understanding exactly what the obstruction is. Ongoing work in this direction is collected at ai-control, and will hopefully be published in a clear format by the end of 2016.)
Some approaches to AI probably can’t yield approval-directed agents. For example, we could perform a search which treats possible agents as a black boxes and measures their behavior for signs of intelligence. Such a search could (eventually) find a human-level intelligence, but would give us very crude control over how that intelligence was applied. We could get some kind of goal-directed behavior by selecting for it, but selecting for approval-directed behavior would be difficult:
- The paucity of data on approval is a huge problem in this setting. (Note: semi-supervised reinforcement learning is an approach to this problem.)
- You have no control over the internal behavior of the agent, which you would expect to be optimized for pursuing a particular goal: maximizing whatever measure of “approval” that you used to guide your search. (Note: I no longer endorse this argument as written; reward engineering is a response to the substance of this concern.)
- Agents who maximized your reported approval in test cases need not do so in general, any more than humans are reliable reproductive-fitness-maximizers. (Note: red teaming is an approach to this problem.)
But  and especially  are also problems when designing a goal-directed agent with agreeable goals, or indeed any particular goals at all. Though approval-direction can’t deal with these problems, they aren’t new problems.
Such a black-box search—with little insight into the internal structure of the agents—seems worrying no matter how we approach AI safety. Fortunately, it also seems unlikely (though not out of the question).
A similar search is more likely to be used to produce internal components of a larger system (for example, you might train a neural network to identify objects, as a component of a system for navigating an unknown environment). This presents similar challenges, concerning robustness and unintended behaviors, whether we are designing a goal-directed or approval-directed agent.
This essay was originally posted here on 1st December 2014. The second half of it can be found in the next post in this sequence.
Tomorrow's AI Alignment Forum sequences post will be 'Approval-directed agents: "implementation" details', by Paul Christiano.