Goal-directedness is behavioral, not structural

by Adam Shimi3 min read8th Jun 20202 comments



Goal-directedness is the term used by the AI Safety community to point to a specific property: following a goal. It comes from Rohin Shah's post in his sequence, but the intuition pervades many safety issues and current AI approaches. Yet it lacks a formal definition, or even a decomposition into more or less formal subcomponents.

Which questions we want to answer about goal-directed systems underlies the sort of definition we're looking for. There are two main questions that Rohin asks in his posts:

  • Are non goal-directed systems or less goal-directed ones inherently safer than fully goal-directed ones?
  • Can non-goal-directed systems or less goal-directed ones be competitive with fully goal-directed ones?

Answering these will also answer the really important meta-question: should we put resources into non-goal-directed approaches to AGI?

Notice that both questions above are about predicting properties of the system based on its goal-directedness. These properties we care about depend only on the behavior of the system, not on its internal structure. It thus makes sense to consider that goal-directedness should also depend only on the behavior of the system. For if it didn't, then two systems with the same properties (safety, competitiveness) would have different goal-directedness, breaking the pattern of prediction.

Actually, this assumes that our predictor is injective: it sends different "levels" of goal-directedness to different values of the properties. I agree with this intuition, given how much performance and safety issues seem to vary according to goal-directedness. But I wanted to make it explicit.

Reiterating the point of the post: goal-directedness is a property of behavior, not internal structure. By this I mean that given the complete behavior of a system over all environment, goal-directedness is independent of what's inside the system. Or equivalently, if two systems always behave in the same way, their goal-directedness is the same, regardless of if one contains a big lookup table and the other an homonculus.

This is not particularly original: Dennett's intentional stance pretty much says the same thing. (The Intentional Stance, p 15)

Then I will argue that any object -- or as I shall say, any system -- whose behavior is well predicted by this strategy [considering it as moving towards a goal] is in the fullest sense of the word a believer. What it is to be a true believer is to be an intentional system, a system whose behavior is reliably and voluminously predictable via the intentional strategy.

Why write a post about it, then? I'm basically saying that our definition should depend only on observable behavior, which is pretty obvious, isn't it?

Well, goal is a very loaded term. It is a part of the set of mental states we attribute to human beings, and other agents, but that we are reluctant to give to anything else. See how I never used the word "agent" before in this post, preferring "system" instead? That was me trying to limit this instinctive thinking about what's inside. And here is the reason why I think this post is not completely useless: when looking for a definition of goal-directedness, the first intuition is to look for the internal structure. It seems obvious that goals should be somewhere "inside" the system, and thus that what really matters is the internal structure.

But as we saw above, goal-directedness should probably depend only on the complete behavior of the system. That is not to say that the internal structure is not important or useful here. On the contrary, this structure, in the form of source code for example, is usually the only thing we have at our disposal. It serves to compute goal-directedness, instead of defining it.

We thus have this split:

  • Defining goal-directedness: depends only on the complete behavior of the system, and probably assumes infinite compute and resources.
  • Computing goal-directedness: depends on the internal structure, and more specifically what information about the complete behavior can be extracted from this structure.

What I see as a mistake here, a mistake I personally made, is to look for the definition in the internal structure. To look at some neural net, or some C program, and try to find where the goals are and what makes the program follow them. Instead, I think we should define and formalize goal-directedness from the ideal context of knowing the full behavior of the system, and then use interpretability and formal methods to extract what's relevant to this definition from the internal structure.

Thanks to Jérémy Perret for feedback on the writing, and to Joe Collman, Michele Campolo and Sabrina Tang for feedback on the idea.


2 comments, sorted by Highlighting new comments since Today at 8:40 AM
New Comment

Attempting to approach goal directedness behaviorally is, I expect, going to run into the same problems as trying to infer policy from behaviors only: you can't do it unless you make some normative assumption. This is exactly analogous to the Armstrong's No Free Lunch Theorem for value learning and, to turn it around the other way, we can similarly assign any goal whatsoever to a system based solely on its behavior unless we make some sufficiently strong normative assumption about it.

That's a very good point. I actually think we can avoid this problem, due to a couple of things:

  • As I mentioned in another comment, what I mean by behaviorally is not simply looking at the behavior, it also includes taking the intentional stance towards the system. And therefore making rather strong normative assumptions about it.
  • If we use focus, then not all systems are maximally focused towards all goals. Where I think the problem creeps back in is in the fact that many goals (like the one containing all states, which means intuitively that the goal is to reach any state) will be maximally focused for many if not all systems. My attempt at an answer is the triviality measure of the goal, as a counterweight. But it's still possible in theory to have two goals of equivalent triviality and equivalent focus; in that case I don't really know yet how to "choose".