In the previous post, I argued that simply knowing that an AI system is superintelligent does not imply that it must be goal-directed. However, there are many other arguments that suggest that AI systems will or should be goal-directed, which I will discuss in this post.

Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent -- I would classify this as “Agent AI that is not goal-directed”. (But see this comment thread for discussion.)

Note that these arguments have different implications than the argument that superintelligent AI must be goal-directed due to coherence arguments. Suppose you believe all of the following:

  • Any of the arguments in this post.
  • Superintelligent AI is not required to be goal-directed, as I argued in the last post.
  • Goal-directed agents cause catastrophe by default.

Then you could try to create alternative designs for AI systems such that they can do the things that goal-directed agents can do without themselves being goal-directed. You could also try to persuade AI researchers of these facts, so that they don’t build goal-directed systems.

Economic efficiency: goal-directed humans

Humans want to build powerful AI systems in order to help them achieve their goals -- it seems quite clear that humans are at least partially goal-directed. As a result, it seems natural that they would build AI systems that are also goal-directed.

This is really an argument that the system comprising the human and AI agent should be directed towards some goal. The AI agent by itself need not be goal-directed as long as we get goal-directed behavior when combined with a human operator. However, in the situation where the AI agent is much more intelligent than the human, it is probably best to delegate most or all decisions to the agent, and so the agent could still look mostly goal-directed.

Even so, you could imagine that even the small part of the work that the human continues to do allows the agent to not be goal-directed, especially over long horizons. For example, perhaps the human decides what the agent should do each day, and the agent executes the instruction, which involves planning over the course of a day, but no longer. (I am not arguing that this is safe; on the contrary, having very powerful optimization over the course of a day seems probably unsafe.) This could be extremely powerful without the AI being goal-directed over the long term.

Another example would be a corrigible agent, which could be extremely powerful while not being goal-directed over the long term. (Though the meanings of “goal-directed” and “corrigible” are sufficiently fuzzy that this is not obvious and depends on the definitions we settle on for each.)

Economic efficiency: beyond human performance

Another benefit of goal-directed behavior is that it allows us to find novel ways of achieving our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-directed behavior is one of the few methods we know of that allow AI systems to exceed human performance.

I think this is a good argument for goal-directed behavior, but given the problems of goal-directed behavior I think it’s worth searching for alternatives, such as the two examples in the previous section (optimizing over a day, and corrigibility). Alternatively, we could learn human reasoning, and execute it for a longer subjective time than humans would, in order to make better decisions. Or we could have systems that remain uncertain about the goal and clarify what they should do when there are multiple very different options (though this has its own problems).

Current progress in reinforcement learning

If we had to guess today which paradigm would lead to AI systems that can exceed human performance, I would guess reinforcement learning (RL). In RL, we have a reward function and we seek to choose actions that maximize the sum of expected discounted rewards. This sounds a lot like an agent that is searching over actions for the best one according to a measure of goodness (the reward function [1]), which I said previously is a goal-directed agent. And the math behind RL says that the agent should be trying to maximize its reward for the rest of time, which makes it long-term [2].

That said, current RL agents learn to replay behavior that in their past experience worked well, and typically do not generalize outside of the training distribution. This does not seem like a search over actions to find ones that are the best. In particular, you shouldn’t expect a treacherous turn, since the whole point of a treacherous turn is that you don’t see it coming because it never happened before.

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term. Of course, many tasks would have very long episodes, such as being a CEO. The vanilla deep RL approach here would be to specify a reward function for how good a CEO you are, and then try many different ways of being a CEO and learn from experience. This requires you to collect many full episodes of being a CEO, which would be extremely time-consuming.

Perhaps with enough advances in model-based deep RL we could train the model on partial trajectories and that would be enough, since it could generalize to full trajectories. I think this is a tenable position, though I personally don’t expect it to work since it relies on our model generalizing well, which seems unlikely even with future research.

These arguments lead me to believe that we’ll probably have to do something that is not vanilla deep RL in order to train an AI system that can be a CEO, and that thing may not be goal-directed.

Overall, it is certainly possible that improved RL agents will look like dangerous long-term goal-directed agents, but this does not seem to be the case today and there seem to be serious difficulties in scaling current algorithms to superintelligent AI systems that can optimize over the long term. (I’m not arguing for long timelines here, since I wouldn’t be surprised if we figured out some way that wasn’t vanilla deep RL to optimize over the long term, but that method need not be goal-directed.)

Existing intelligent agents are goal-directed

So far, humans and perhaps animals are the only example of generally intelligent agents that we know of, and they seem to be quite goal-directed. This is some evidence that we should expect intelligent agents that we build to also be goal-directed.

Ultimately we are observing a correlation between two things with sample size 1, which is really not much evidence at all. If you believe that many animals are also intelligent and goal-directed, then perhaps the sample size is larger, since there are intelligent animals with very different evolutionary histories and neural architectures (eg. octopuses).

However, this is specifically about agents that were created by evolution, which did a relatively stupid blind search over a large space, and we could use a different method to develop AI systems. So this argument makes me more wary of creating AI systems using evolutionary searches over large spaces, but it doesn’t make me much more confident that all good AI systems must be goal-directed.

Interpretability

Another argument for building a goal-directed agent is that it allows us to predict what it’s going to do in novel circumstances. While you may not be able to predict the specific actions it will take, you can predict some features of the final world state, in the same way that if I were to play Magnus Carlsen at chess, I can’t predict how he will play, but I can predict that he will win.

I do not understand the intent behind this argument. It seems as though faced with the negative results that suggest that goal-directed behavior tends to cause catastrophic outcomes, we’re arguing that it’s a good idea to build a goal-directed agent so that we can more easily predict that it’s going to cause catastrophe.

I also think that we would typically be able to predict significantly more about what any AI system we actually build will do (than if we modeled it as trying to achieve some goal). This is because “agent seeking a particular goal” is one of the simplest models we can build, and with any system we have more information on, we start refining the model to make it better.

Summary

Overall, I think there are good reasons to think that “by default” we would develop goal-directed AI systems, because the things we want AIs to do can be easily phrased as goals, and because the stated goal of reinforcement learning is to build goal-directed agents (although they do not look like goal-directed agents today). As a result, it seems important to figure out ways to get the powerful capabilities of goal-directed agents through agents that are not themselves goal-directed. In particular, this suggests that we will need to figure out ways to build AI systems that do not involve specifying a utility function that the AI should optimize, or even learning a utility function that the AI then optimizes.


[1] Technically, actions are chosen according to the Q function, but the distinction isn’t important here.

[2] Discounting does cause us to prioritize short-term rewards over long-term ones. On the other hand, discounting seems mostly like a hack to make the math not spit out infinities, and so that learning is more stable. On the third hand, infinite horizon MDPs with undiscounted reward aren't solvable unless you almost surely enter an absorbing state. So discounting complicates the picture, but not in a particularly interesting way, and I don’t want to rest an argument against long-term goal-directed behavior on the presence of discounting.

New Comment
22 comments, sorted by Click to highlight new comments since: Today at 9:15 AM

Note that I don’t think of this as the Tool AI vs. Agent AI argument: it seems possible to build agent AI systems that are not goal-directed. For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

I'm not very convinced by this example, or alternatively I'm not getting the distinction you're drawing between "agent" and "goal-directed". Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):

Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I definitely endorse this point, think that it's an important aspect, and that it alone justifies the claim that I was making about non-goal-directed Agent AI being possible.

That said, I do have an intuition that agents whose goal-directedness comes from other agents shouldn't be considered goal-directed, at least if it happens in a particular way. Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has", and this feels distinctly less goal-directed to me. (In particular, my AI agent would not have all of the convergent instrumental subgoals in this setting, so it is really different in kind from an AI agent that was simply pursuing X to the best of its ability.)

"Goal-directed" may not be the right word to capture the property I'm thinking about. It might be something like "thing that pursues the standard convergent instrumental subgoals", or "thing that pursues a goal that is not defined in terms of someone else's goal".

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents.

Yeah, that idea was a big influence on the views that caused me to write this post.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

It's not exactly the same, but it is very similar. You could think of approval-direction as imitation of a particular weird kind of human, who deliberates for a while before choosing any action.

They feel different enough to me that there probably are safety-relevant differences, but I don't know of any off the top of my head. Initially I was going to say that myopia was a safety-relevant difference, but thinking about it more I don't think that's an actual difference. Approval-directed agents are more explicitly myopic, but I think imitation learning could be myopic in the same way.

Btw, this post also views Paul's agenda through the lens of constructing imitations of humans.

Btw, this post also views Paul’s agenda through the lens of constructing imitations of humans.

Right, so I think I wasn't really making a new observation, but just clearing up a confusion on my own part, where for a long time I didn't understand how the idea of approval-directed agency fits into IDA because people switched from talking about approval-directed agency to imitation learning (or were talking about them interchangeably) and I didn't catch the connection. So at this point I understand Paul's trajectory of views as follows:

goal-directed agent => approval-directed agent => use IDA to scale up approval-direct agent => approval-directed agency as a form of imitation learning / generalize to other forms of imitation learning => generalize IDA to safely scale up other (including more goal-directed / consequentialist) forms of ML (see An Unaligned Benchmark which I think represents his current views)

(Someone please chime in if this still seems wrong or confused.)

For example, imitation learning allows you to create an agent that behaves similarly to another agent—I would classify this as “Agent AI that is not goal-directed”.

Let's say that I'm pursuing goal X, and my assistant AI agent is also pursuing goal X as a result. If I then start to pursue goal Y, and my AI agent also starts pursuing Y because it is aligned with me, then it feels like the AI was not really directed at goal X, but more directed at "whatever goal Rohin has"

What causes the agent to switch from X to Y?

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal . Is that what you want?)

Are you thinking of the "agent" as A) the product of the demonstrations and training (e.g. the resulting neural network), or as B) a system that includes both the trained agent and also the training process itself (and facilities for continual online learning)?

I was imagining something more like B for the imitation learning case.

I would assume A by default, but then I would expect that if you trained such an agent with imitation learning while pursuing goal X, you'd likely get an agent that continues to pursue goal X even after you've switched to pursuing goal Y. (Unless the agent also learned to imitate whatever the decision-making process was that led you to switch from X to Y, in which case the agent seems non-goal-directed only insofar as you decided to switch from X to Y for non-goal-related reasons rather than in service of some higher level goal Ω. Is that what you want?)

That analysis seems right to me.

With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

With respect to whether it is what I want, I wouldn't say that I want any of these things in particular, I'm more pointing at the existence of systems that aren't goal-directed, yet behave like an agent.

Would you agree that a B-type agent would be basically as goal-directed as a human (because it exhibits goal-directed behavior when the human does, and doesn't when the human doesn't)?

In which case, would it be fair to summarize (part of) your argument as:

1) Many of the potential problems with building safe superintelligent systems comes from them being too goal-directed.

2) An agent that is only as goal-directed as a human is much less susceptible to many of these failure modes.

3) It is likely possible to build superintelligent systems that are only as goal-directed as humans.

?

I don't think so. Maybe this would be true if you had a perfect imitation of a human, but in practice you'll be uncertain about what the human is going to do. If you're uncertain in this way, and you are getting your goals from a human, then you don't do all of the instrumental subgoals. (See The Off-Switch Game for a simple analysis showing that you can avoid the survival incentive.)

It may be that "goal-directed" is the wrong word for the property I'm talking about, but I'm predicting that agents of this form are less susceptible to convergent instrumental subgoals than humans are.

To clarify, you do do the human's instrumental sub-goals though, just not extra ones for yourself, right?

If you've seen the human acquire resources, then you'll acquire resources in the same way.

If there's now some new resource that you've never seen before, you may acquire it if you're sufficiently confident that the human would, but otherwise you might try to gather more evidence to see what the human would do. This is assuming that we have some way of doing imitation learning that allows the resulting system to have uncertainty that it can resolve by watching the human, or asking the human. If you imagine the exact way that we do imitation learning today, it would extrapolate somehow in a way that isn't actually what the human would do. Maybe it acquires the new resource, maybe it leaves it alone, maybe it burns it to prevent anyone from having it, who knows.

They feel different enough to me that there probably are safety-relevant differences

It looks like imitation learning isn't one thing but a fairly broad category in ML which even includes IRL. But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you're optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process. I wonder when you wrote the OP, which form of imitation learning did you have in mind?

ETA: From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?

But if we compare approval direction to the narrower kinds of imitation learning, approval direction seems a lot riskier because you're optimizing over an estimation of human approval, which seems to be an adversarial process that could easily trigger safety problems in both the ground-truth human approval as well as in the estimation process.

But if there are safety problems in approval, wouldn't there also be safety problems in the human's behavior, which imitation learning would copy?

Similarly, if there are safety problems in the estimation process, wouldn't there also be safety problems in the prediction of what action a human would take?

From this comment it looks like you were thinking of an online version of narrow imitation learning. Might be good to clarify that in the post?

I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure. I'll add a pointer to this discussion to the post.

But if there are safety problems in approval, wouldn’t there also be safety problems in the human’s behavior, which imitation learning would copy?

The human's behavior could be safer because a human mind doesn't optimize so much as to move outside of the range of inputs where approval is safe, or it has a "proposal generator" that only generates possible actions that with high probability stay within that range.

Similarly, if there are safety problems in the estimation process, wouldn’t there also be safety problems in the prediction of what action a human would take?

Same here, if you just predict what action a human would take, you're less likely to optimize so much that you likely end up outside of where the estimation process is safe.

I somewhat think that it applies to most imitation learning, not just the online variant of narrow imitation learning, but I am pretty confused/unsure.

Ok, I'd be interested to hear more if you clarify your thoughts.

Can approval-directed agents be considered a form of imitation learning, and if not, are there any safety-relevant differences between imitation learning of (speeded-up) humans, and approval-directed agents?

I found an old comment from Paul that answers this:

I think that the only reason to be interested in approval-directed agents rather than straightforward imitation learners is that it may be harder to effectively imitate behavior than to solve the same task in a very different way.

Your post reminded me of Paul Christiano's approval-directed agents which was also about trying to find an alternative to goal-directed agents. Looking at it again, it actually sounds a lot like applying imitation learning to humans (except imitating a speeded-up human):

It seems like approval direction allows for creative actions that the human operator approves of but would not have thought of doing themselves. Not sure if imitation learning does this.

That's a good question. It looks like imitation learning actually covers a number of ML techniques (see this) none of which exactly matches approval-directed agents. But the category seems broad enough that I think approval-directed agents can be considered to be a form of imitation learning. In particular, IRL is considered a form of imitation learning and IRL would also be able to perform actions that the human would not have thought of doing themselves.

^ Yes to all of this.

A little bit of nuance: IRL is considered to be a form of imitation learning because in many cases the inferred reward in IRL is only meant to reproduce the human's performance and isn't expected to generalize outside of the training distribution.

There are versions of IRL which are meant to go beyond imitation. For example, adversarial IRL was trying to infer a reward that would generalize to new environments, in which case it would be doing something more than imitation.

Suppose the agent you're trying to imitate is itself goal-directed. In order for the imitator to generalize beyond its training distribution, it seemingly has to learn to become goal-directed (i.e., perform the same sort of computations that a goal-directed agent would). I don't see how else it can predict what the goal-directed agent would do in a novel situation. If the imitator is not able to generalize, then it seems more tool-like than agent-like. On the other hand, if the imitatee is not goal-directed... I guess the agent could imitate humans and be not entirely goal-directed to the extent that humans are not entirely goal-directed. (Is this the point you're trying to make, or are you saying that an imitation of a goal-directed agent would constitute a non-goal-directed agent?)

I'm not sure these are the points Rohin was trying to make, but there seem to be at least two important points here:

  • Imitation learning applied to humans produces goal-directed behavior only insofar humans are goal-directed
  • Imitation learning applied to humans produces agents no more capable than humans. (I think IDA goes beyond this by adding amplification steps, which are separate. And IRL goes beyond this by trying to correct "errors" that the humans make.)

Regarding the second point, there's a safety-relevant sense in which a human-imitating agent is less goal-directed than the human. Because if you scale the human's capabilities, the human will become better at achieving its personal objectives. By contrast, if you scale the imitator's capabilities, it's only supposed to become even better at imitating the unscaled human.

In addition, current RL is episodic, so we should only expect that RL agents are goal-directed over the current episode and not in the long-term.

Is this true? Since ML generally doesn't choose an algorithm directly but runs a search over a parameter space, it seems speculative to assume that the resulting model, if it is a mesa-optimizer and goal-directed, only cares about its episode. If it learned that optimizing for X is good for reward, it seems at least conceivable that it won't understand that it shouldn't care about instances of X that appear in future episodes.

A few points:

1. It's not clear that the current deep RL paradigm would lead to a mesa optimizer. I agree it could happen, but I would like to see an argument as to why it is likely to happen. (I think there is probably a stronger case that any general intelligence we build will need to be a mesa optimizer and therefore goal-directed, and if so that argument should be added to this list.)

2. Even if we did get a mesa optimizer, the base optimizer (e.g. gradient descent) would plausibly select for mesa optimizers that care only up till the end of the episode. A mesa optimizer that wasn't myopic in this way might spend the entire episode learning and making money that it can use in the future, and as a result get no training reward, and so would be selected against by the outer optimizer.

Here are a few more reasons for humans to build goal-directed agents:

  1. Goal directed AI is a way to defend against value drift/corruption/manipulation. People might be forced to build goal directed agents if they can't figure out another way to do that.

  2. Goal directed AI is a way to cooperate and thereby increase economic efficiency and/or military competitiveness. (A group of people can build a goal directed agent that they can verify represents an aggregation of their values.) People might be forced to build or transfer control to goal directed agents in order to participate in such cooperation to remain competitive, unless they can figure out another way to cooperate that is as efficient as this.

  3. Goal directed AI is a way to address other human safety problems. People might trust an AI with explicit and verifiable values more than an AI that is controlled by a distant stranger.

As I understand it, the first one is an argument for value lock in, and the third one is an argument for interpretability, does that seem right to you?

For the first one, I guess I would use "argument for defense against value drift" instead since you could conceivably use a goal-directed AI to defend against value drift without lock in, e.g., by doing something like Paul Christiano's 2012 version of indirect normativity (which I don't think is feasible but maybe there's something like it that is, like my hybrid approach, if you consider that goal-directed).

For the third one, I guess interpretability is part of it, but a bigger problem is that it seems hard to make a sufficiently trustworthy human overseer even if we could "interpret" them. In other words, interpretability for a human might just let us see exactly why we shouldn't trust them.