Approval-directed agents: details

byPaul Christiano19d23rd Nov 2018No comments


Follow-up to approval-directed agents: overview.

So far I’ve talked about approval-direction imprecisely. Maybe I’m talking about something incoherent, which has desirable properties only in the same sense as a four-sided triangle—vacuously. I won’t really be able to dispel this concern here, but I’ll at least take some steps.

How do you define approval?

Eventually you would have to actually write code implementing approval-directed behavior. What might that code look like? I want to set aside the problem “what does a sophisticated AI look like?” since I obviously don’t know. So let’s suppose we had some black box that did all of the hard work. I’ll consider a few cases for what the black box does, ranging from “easy to work with” to “very hard to work with.”

(Note: I now believe that we can target AI systems trained (nearly) end-to-end with gradient descent, which is most similar to “learning from examples.”)

Natural language

As an easy case, suppose we have a natural language question-answering system, which can assign a probability to any natural language sentence. In this case, we ask the question:

“Suppose that Hugh understood the current situation, was asked `on a scale from 0 to 1, how good is the action a?’ and was given a few hours to determine his answer. What would his answer be?”

We then loop over each action a and take the action with the highest expected answer.

In this framework, it is easy to replace Hugh by a more powerful overseer—all you have to do is specify the replacement in natural language.

“Math intuition module”

At an opposite extreme, suppose we have a “math intuition module,” a system which can assign probabilities only to perfectly precise statements—perhaps of the form “algorithm A returns output y on input x.”

I’ve written about defining “approval upon reflection” algorithmically (see here, here). These definition can be used to define approval-directed behavior completely precisely. I’m pretty hesitant about these definitions, but I do think it is promising that we can get traction even in such an extreme case.

In reality, I expect the situation to be somewhere in between the simple case of natural language and the hard case of mathematical rigor. Natural language is the case where we share all of our concepts with our machines, while mathematics is the case where we share only the most primitive concepts. In reality, I expect we will share some but not all of our concepts, with varying degrees of robustness. To the extent that approval-directed decisions are robust to imprecision, we can safely use some more complicated concepts, rather than trying to define what we care about in terms of logical primitives.

Learning from examples

In an even harder case, suppose we have a function learner which can take some labelled examples f(x) = y and then predict a new value f(x’). In this case we have to define “Hugh’s approval” directly via examples. I feel less comfortable with this case, but I’ll take a shot anyway.

In this case, our approval-directed agent Arthur maintains a probabilistic model over sequences observation[T] and approval[T](a). At each step T, Arthur selects the action a maximizing approval[T](a). Then the timer T is incremented, and Arthur records observation[T+1] from his sensors. Optionally, Hugh might specify a value approval[t](a) for any time t and any action a’. Then Arthur updates his models, and the process continues.

Like AIXI, if Arthur is clever enough he eventually learns that approval[T](a)refers to whatever Hugh will retroactively input. But unlike AIXI, Arthur will make no effort to manipulate these judgments. Instead he takes the action maximizing his expectation of approval[T] — i.e., his prediction about what Hugh will say in the future, if Hugh says anything at all. (This depends on his self-predictions, since what Hugh does in the future depends on what Arthur does now.)

At any rate, this is quite a lot better than AIXI, and it might turn out fine if you exercise appropriate caution. I wouldn’t want to use it in a high-stakes situation, but I think that it is a promising idea and that there are many natural directions for improvement. For example, we could provide further facts about approval (beyond example values), interpolating continuously between learning from examples and using an explicit definition of the approval function. More ambitiously, we could implement “approval-directed learning,” preventing it from learning complicated undesired concepts.

How should Hugh rate?

So far I’ve been very vague about what Hugh should actually do when rating an action. But the approval-directed behavior depends on how Hugh decides to administer approval. How should Hugh decide?

If Hugh expects action a to yield better consequences than action b, then he should give action a a higher rating than action b. In simple environments he can simply pick the best action, give it a rating of 1, and give the other options a rating of 0.

If Arthur is so much smarter than Hugh that he knows exactly what Hugh will say, then we might as well stop here. In this case, approval-direction amounts to Arthur doing exactly what Hugh instructs: “the minimum of Arthur’s capabilities and Hugh’s capabilities” is equal to “Hugh’s capabilities.”

But most of the time, Arthur won’t be able to tell exactly what Hugh will say. The numerical scale between 0 and 1 exists to accomodate Arthur’s uncertainty.

To illustrate the possible problems, suppose that Arthur is considering whether to drive across a bridge that may or may not collapse. Arthur thinks the bridge will collapse with 1% probability. But Arthur also think that Hugh knows for sure whether or not the bridge will collapse. If Hugh always assigned the optimal action a rating of 1 and every other action a rating of 0, then Arthur would take the action that was most likely to be optimal — driving across the bridge.

Hugh should have done one of two things:

  • Give a bad rating for risky behavior. Hugh should give Arthur a high rating only if he drives across the bridge and knows that it is safe. In general, give a rating of 1 to the best action ex ante.
  • Assign a very bad rating to incorrectly driving across the bridge, and only a small penalty for being too cautious. In general, give ratings that reflect the utilities of possible outcomes—to the extent you know them.

Probably Hugh should do both. This is easier if Hugh understands what Arthur is thinking and why, and what range of possibilities Arthur is considering.

Other details

I am leaving out many other important details in the interest of brevity. For example:

  • In order to make these evaluations Hugh might want to understand what Arthur is thinking and why. This might be accomplished by giving Hugh enough time and resources to understand Arthur’s thoughts; or by letting different instances of Hugh “communicate” to keep track of what is going on as Arthur’s thoughts evolve; or by ensuring that Arthur’s thoughts remains comprehensible to Hugh (perhaps by using approval-directed behavior at a lower level, and only approving of internal changes that can be rendered comprehensible).
  • It is best if Hugh optimizes his ratings to ensure the system remains robust. For example, in high stakes settings, Hugh should sometimes make Arthur consult the real Hugh to decide how to proceed—even if Arthur correctly knows what Hugh wants. This ensures that Arthur will seek guidance when he incorrectly believes that he knows what Hugh wants.

…and so on. The details I have included should be considered illustrative at best. (I don’t want anyone to come away with a false sense of precision.)


It would be sloppy to end the post without a sampling of possible pitfalls. For the most part these problems have more severe analogs for goal-directed agents, but it’s still wise to keep them in mind when thinking about approval-directed agents in the context of AI safety.

My biggest concerns

I have three big concerns with approval-directed agents, which are my priorities for follow-up research:

  • Is an approval-directed agent generally as useful as a goal-directed agent, or does this require the overseer to be (extremely) powerful? Based on the ideas in this post, I am cautiously optimistic.
  • Can we actually define approval-directed agents by examples, or do they already need a shared vocabulary with their programmers? I am again cautiously optimistic.
  • Is it realistic to build an intelligent approval-directed agent without introducing goal-directed behavior internally? I think this is probably the most important follow-up question. I would guess that the answer will be “it depends on how AI plays out,” but we can at least get insight by addressing the question in a variety of concrete scenarios.

Motivational changes for the overseer

“What would I say if I thought for a very long time?” might have a surprising answer. The very process of thinking harder, or of finding myself in a thought experiment, might alter my priorities. I may care less about the real world, or may become convinced that I am living in a simulation.

This is a particularly severe problem for my proposed implementation of indirect normativity, which involves a truly outlandish process of reflection. It’s still a possible problem for defining approval-direction, but I think it is much less severe.

“What I would say after a few hours,” is close enough to real life that I wouldn’t expect my thought process to diverge too far from reality, either in values or beliefs. Short time periods are much easier to predict, and give less time to explore completely unanticipated lines of thought. In practice, I suspect we can also define something like “what I would say after a few hours of sitting at my desk under completely normal conditions,” which looks particularly innocuous.

Over time we will build more powerful AI’s with more powerful (and perhaps more exotic) overseers, but making these changes gradually is much easier than making them all at once: small changes are more predictable, and each successive change can be made with the help of increasingly powerful assistants.

Treacherous turn

If Hugh inadvertently specifies the wrong overseer, then the resulting agent might be motivated to deceive him. Any rational overseer will be motivated to approve of actions that look reasonable to Hugh. If they don’t, Hugh will notice the problem and fix the bug, and the original overseer will lose their influence over the world.

This doesn’t seem like a big deal—a failed attempt to specify “Hugh” probably won’t inadvertently specify a different Hugh-level intelligence, it will probably fail innocuously.

There are some possible exceptions, which mostly seem quite obscure but may be worth having in mind. The learning-from-examples protocol seems particularly likely to have problems. For example:

  • Someone other than Hugh might be able to enter training data for approval[T](a). Depending on how Arthur is defined, these examples might influence Arthur’s behavior as soon as Arthur expects them to appear. In the most pathological case, these changes in Arthur’s behavior might have been the very reason that someone had the opportunity to enter fraudulent training data.
  • Arthur could accept the motivated simulation argument, believing himself to be in a simulation at the whim of a simulator attempting to manipulate his behavior.
  • The simplest explanation for Hugh’s judgments may be a simple program motivated to “mimic” the series approval[T] and observation[T] in order to influence Arthur.


An approval-directed agent may not be able to figure out what I approve of.

I’m skeptical that this is a serious problem. It falls under the range of predictive problems I’d expect a sophisticated AI to be good at. So it’s a standard objective for AI research, and AI’s that can’t make such predictions probably have significantly sub-human ability to act in the world. Moreover, even a fairly weak reasoner can learn generalizations like “actions that lead to Hugh getting candy, tend to be approved of” or “actions that take control away from Hugh, tend to be disapproved of.”

If there is a problem, it doesn’t seem like a serious one. Straightforward misunderstandings will lead to an agent that is inert rather than actively malicious (see the “Fail gracefully” section). And deep misunderstandings can be avoided, by Hugh approving of the decision “consult Hugh.”


Making decisions by asking “what action would your owner most approve of?” may be more robust than asking “what outcome would your owner most approve of?” Choosing actions directly has limitations, but these might be overcome by a careful implementation.

More generally, the focus on achieving safe goal-directed behavior may have partially obscured the larger purpose of the AI safety community, which should be achieving safe and useful behavior. It may turn out that goal-directed behavior really is inevitable or irreplaceable, but the case has not yet been settled.

This post was originally posted here on 1st December 2014.

Tomorrow's AI Alignment Forum sequences post will be 'Fixed Point Discussion' by Scott Garrabrant, in the sequence 'Fixed Points'.

The next posts in this sequence will be 'Approval directed bootstrapping' and 'Humans consulting HCH', two short posts which will come out on Sunday 25th November.