Deconfusing goal-directedness would boost your favorite research approach for solving AI Alignment.

Why? Because every approach I know of stands to gain from the clarification of goal-directedness, from Prosaic AGI Alignment to Agents Foundations. In turn, this ubiquitous usefulness of goal-directedness motivates the writing of this sequence, which will include a literature review of the idea in the AI Safety literature and beyond, as well as advanced explorations of goal-directedness by me and collaborators Michele Campolo and Joe Collman.

But before that, I need to back up my provocative thesis. This is why this post exists: it compiles reasons to care about goal-directedness, from the perspective of every research approach and direction I could think of. Although not all reasons given are equally straightforward, none feels outrageously far-fetched to me.

I thus hope that by the end of this post, you will agree that improving our understanding of goal-directedness is relevant for you too.

Thanks to Michele Campolo and Joe Collman for many research discussions, and feedback on this post. Thanks to Alexis Carlier, Evan Hubinger, and Jérémy Perret for feedback on this post.

Meaning of Deconfusion

Before giving you the reasons for caring about goal-directedness, I need to synchronize our interpretations of “deconfusion”. The term comes from MIRI, and specifically this blog post; it captures the process of making a concept clear and explicit enough to have meaningful discussions about it. So it’s not about solving all problems related to the concept, or even formalizing it perfectly (although that would be nice) -- just about allowing coherent thinking. To quote Nate Soares (MIRI’s Executive Director, and the author of the linked blog post):

By deconfusion, I mean something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”

What would that look like for goal-directedness? At first approximation, the idea simply means the property of trying to accomplish a goal. Which feels rather simple. But after digging deeper, issues and subtleties emerge: the difference between having a goal and being competent at accomplishing it (discussed here), what should count as a goal (discussed here), which meaningful classes of goals exist (discussed here), and many others.

Thus the concept is in dire need of deconfusion. Such clarification could take many forms, including:

  • A mathematical formalization
  • A decomposition into formalized components
  • A decomposition into simpler and less confused informal components
  • A list of accepted examples with different levels of goal-directedness
  • A list of properties and their link with the intuitions behind goal-directedness
  • And many more variants

Obviously, only time will reveal the form of our results on goal-directedness. Still, it’s valuable to keep in mind the multitudes of shapes they could take.

Reasons To Care

Let’s be honest: just listing research approach after research approach, and my reason for the relevance of goal-directedness to them, might be too much information to take in one reading. Fortunately, the reasons I found show some trends, and fit neatly into three groups.

  • (Overseeing) In some cases, alignment comes from supervisors and overseers that monitor the AI during training. Goal-directedness is a natural and fundamental property to check, because of its many negative consequences. So deconfusion would facilitate checks of this important property by overseers and supervisors, and thus improve every approach depending on monitoring.
  • (Additional Structure on Utility/Reward Functions) Many approaches to alignment rely on utility functions and reward functions to capture goals and values. Such representations are powerful, but so general that maximizing a utility function or a reward function doesn’t reveal much about whether the system actually follows a goal or not (see the discussion here and here).
    Furthering our understanding of goal-directedness could reveal more structure to add on these representations of goals, making the pursuit of such a “goal” more closely tied to being goal-directed.
  • (Natural Mathematical Abstraction) When attempting to formalize and clarify many aspects of decision making, AI and alignment, concepts like agency and optimization play a big role. Goal-directedness naturally relates to both, because agents are generally considered goal-directed, and so are explicit optimizers doing internal search. Thus goal-directedness should intuitively play a role in these formalizations, whether as a building block, a metric or an example to draw from.


The reasons from this section assume the use of an overseer. This is common for approaches affiliated with Prosaic AI Alignment, where the gist of alignment emerges from training constraints that forbid, push and monitor specific behaviors.

Interpretability and Formal Methods

Interpretability is one way to monitor an AI: it studies how the learned models work, and how to interpret and explain them. Similarly, formal methods (applied to AI) take a formal specification, a model of computation and an AI, and verify whether the AI follows the specification when executed on this model of computation.

Ultimately, both interpretability and formal methods try to check properties of trained models, notably neural networks. Goal-directedness is an example of an important property to look for, as discussed above. And deconfusing goal-directedness would move us towards finding a specification of this property.

(Interpretability à la Clarity Team at OpenAI (for example here) might also prove important in deconfusing goal-directedness, by letting us look into and compare systems with various levels of goal-directedness)

IDA and Debate

Iterated Distillation and Amplification (IDA) and AI Safety via Debate (Debate) are two alignment schemes proposed respectively by Paul Christiano and Geoffrey Irving, and extended by many others.

IDA attempts to align a superintelligent AI by starting from a simple AI, amplifying it (training a second AI to imitate the human supervisor using the simple AI), and then distilling this amplified version (by training a simpler model to imitate the amplified AI) into a new AI that can be used by the human supervisor. Hopefully, repeating this will eventually create an AI with superhuman capabilities, while maintaining alignment.

Debate on the other hand places the human supervisor as a judge of a debate between two AIs. This value of debate comes from extending the reach of human feedback: judging a debate (which only presents bits and pieces of the arguments) is intuitively easier than checking a complete solution, which is easier than finding a solution. Whether or not debate works hinges on the importance of honesty for optimal play, and other theoretical and empirical questions about human evaluation of debates.

The current approach to IDA, and some approaches to Debate, rely on the same building block, factored cognition: answering questions by splitting them into subquestions given to other AIs with some but not all the context, with recursive splitting allowed. And a big question with this factorization is how to do it without allowing one subAI to manipulate the whole decision.

Deconfusing goal-directedness would help, in that checking the level of goal-directedness of the subAIs, or what kind of goals they might have, correlates with checking non-manipulation. Indeed, manipulation probably requires incentives to make long term plans towards something, which looks, smells and sounds like goal-directedness.

Ascription Universality

Universality (see also my own explanation) is a property of an overseer that matters a great deal to approaches like IDA and Debate. Basically, an overseer O is ascription universal for a computation C with regard to an epistemic viewpoint (usually the human supervisor H) if, whatever finite information about O and C is provided to H, the latter will privilege the probabilistic predictions of O over the ones of C. This is supposed to capture the fact that O is able to reveal the different beliefs or reasons that might be ascribed to C to explain it.

One very natural way to ascribe belief is through assuming goals and a desire to accomplish them. This is basically the intentional stance of Daniel Dennett, a classical framework to understand goal-directedness, and deconfusing goal-directedness would put such an ascription of beliefs into more formal and computational grounds.


The inner alignment problem, as stated in Risks from Learned Optimization, concerns learned optimizers: optimizers that result from an optimization process, like gradient descent. Such learned optimizers are called mesa-optimizers, and the inner alignment research direction studies two questions: under which conditions mesa-optimizers appear, and how aligned the mesa-objective is with the base objective of the initial optimization process.

One common criticism with mesa-optimizers is that their objective (the mesa-objective) is assumed to be explicitly represented in the system (see Richard Ngo’s post here, for example). This is a bit weird, and looks more like a simplifying assumption than a property expected in actual systems. Moreover, maybe some of the reasoning about mesa-optimizers still works when considering learned goal-directed systems instead, as proposed in this comment by Tom Everitt.

By deconfusing goal-directedness, we might find properties of goal-directed systems and use them for studying the previous question: does the reasoning from Risks from Learned Optimization go through when considering learned models with these properties, instead of learned explicit optimizers? Such an investigation could shift the focus of research on inner alignment, or reinforce the importance of internal optimization. 

Deceptive Alignment

One especially worrying case of mesa-optimization is deceptive alignment: a mesa-optimizer with a different objective from the base objective, but which is competent enough to deceive us into believing that it is pursuing the base objective. This can take many forms, from behaving nicely during training and defecting when put in the world, up to more outlandish ideas like gradient hacking.

The foremost proposals for dealing with deceptive alignment rely on relaxed adversarial training: using the overseer feedback (on the risk of catastrophic behavior) as part of the training signal. But getting this overseer feedback right, notably the detection of deceptiveness, proves difficult.

One possibility is to use myopia. Intuitively, myopia is supposed to capture the property that a system only makes short-term plans. Then the hope is that deceptive systems should probably be non-myopic. Thus we could detect non-myopia instead of deception, which is hopefully easier, and get the overseer feedback necessary for relaxed adversarial training.

Now, defining myopia is an open research problem -- see these two posts for pointers. Yet there is an interesting connection with goal-directedness: multiple researchers consider long-term goals as an important component of goal-directedness. Hence disentangling the various intuitions about goal-directedness could help deconfuse the idea of long-term goals, which in turn would help tremendously for deconfusing myopia.

Additional Structure on Utility/Reward Functions

Reasons in this section apply to a broader range of alignment proposals. Their common thread is to assume that utility functions or reward functions are used to capture goals and values.

Agent Incentives

The Safety Team at DeepMind wrote many different papers on agent incentives; specifically, on observation and intervention incentives that come from having a specific goal. Assuming a causal graph of the system and a goal, graphical criteria exists to find which nodes would be useful to monitor (observation incentives), and which nodes would be useful to control (intervention incentives). For goals, these papers consider controlling a utility node in the causal graph. That is, this research places itself within the framework of expected utility maximization.

As mentioned before, utility functions look too general to capture exactly what we mean by goals: every system can be seen as maximizing some utility function, even those intuitively not goal-directed. Deconfusing goal-directedness might allow the derivation of more structure for goals, which could be applied to these utility functions. The goals studied in this approach would then model more closely those of actual goal-directed systems, allowing in turn the derivation of incentives for more concrete and practical settings.

Value Learning

Value Learning is a pretty broad idea, which boils down to learning what we don’t want the AI to mess up (our values), instead of trying to formalize them ourselves. This includes the reward modeling agenda at DeepMind, work on Cooperative Inverse Reinforcement Learning and Inverse Reward Design at CHAI, Stuart Armstrong’s research agenda and G Gordon Worley III’s research agenda, among others.

For all of these, the main value of deconfusing goal-directedness is the same: learning values usually takes the form of learning a utility function or a reward function, that is something similar to a goal. But values probably share many of the structure of goals. Such structure could be added to utility functions or reward functions to model values, if we had a better understanding of goal-directedness.

Impact Measures

Impact measures provide metrics for the impact of specific actions, notably catastrophic impact. Such an impact measure can be used to ensure that even a possibly misaligned AI will not completely destroy all value (for us) on Earth and the universe. There are many different impact measures, but I’ll focus on Alex Turner’s Attainable Utility Preservation (AUP), which is the one I know best and the one which has been discussed the most in recent years.

Attainable Utility Preservation ensures that the attainable utilities (how much value can be reached) for a wide range of goals (reward/utility functions) stays the same or improves after each action of the AI. This should notably remove the incentives for power-seeking, and thus many of the catastrophic unaligned behaviors of AI (while not solving alignment itself).

You guessed it, here too the value of goal-directedness comes from defining goals with more structure than simple utility or reward functions. Among other things, this might help extend AUP to more realistic environments.

Natural Mathematical Abstraction

Lastly, these reasons concern the Agents Foundations part of AI alignment research. They thus assume a focus on formalization, with applications to practical problems of alignment.

Mathematical Theory of RL and Alignment

Vanessa Kosoy from MIRI has been the main proponent of the creation of a mathematical theory of RL and alignment. Her point of view focuses on deriving formal guarantees about alignment in a learning theoretic setting, and this requires a theory of RL dealing with issues like non-realizability and traps.

Such guarantees will probably depend on the goal-directedness of the system, as different levels of goal-directedness should produce different behaviors. So knowing how to capture these levels will ground the dependency of the guarantees on it.

(Note that Vanessa already has her own definition of goal-directed intelligence, which doesn’t seem to completely deconfuse goal-directedness, but may be sufficient for her research).

Embedded Agency

Embedded Agency is a broad class of research directions that focus on dealing with theoretical issues linked to embeddedness -- the fact that the AI inhabits the world on which it acts, as opposed to dualistic models in which the AI and the environment are cleanly separated. The original research agenda carves out four subproblems: Decision Theory, Embedded World Models, Robust Delegation and Subsystem Alignment. I’ll focus on Embedded World Models, which has the clearest ties to goal-directedness. That being said, the others might have some links -- for example Subsystem Alignment is very close to Inner Alignment and Deceptive Alignment, which I already mentioned.

Embedded World Models ask specifically how to represent the world as a model inside the agent. Trouble comes from self-reference: since the agent is part of the world, so is its model, and thus a perfect model would need to represent itself, and this representation would need to represent itself, ad infinitum. So the model cannot be exact. Another issue comes from the lack of hardcoded agent/environment boundary: the model need to add it in some way. 

Understanding goal-directedness would hopefully provide a representation of systems with goals in a compressed way. This helps both with the necessary imprecision of the map (notably because the AI can model itself this way) and to draw a line between such systems and the complex world they inhabit.


John S. Wentworth’s research on abstraction centers around one aspect of Embedded World Models: what can be thrown out of the perfect model to get a simpler non-self-referential model (an abstraction) that is useful for a specific purpose?

Using goal-directedness for modelling systems in a compressed way is an example of a natural abstraction. Searching for a definition of goal-directedness is thus directly relevant to abstraction research, both because of its potential usefulness for building abstractions, and because it’s such a fundamental abstraction that it might teach us some lessons on how to define, study and use abstractions in general.


To summarize, for a broad range of research agendas and approaches, deconfusing goal-directedness is at least partially relevant, and sometimes really important. The reasons behind that statement fit into three categories:

  • Helping an overseer to check for issues during training
  • Adding structure to utility functions/reward functions to make them behave more like goals.
  • Abstracting many important systems into a compressed form..

So you should probably care about goal-directedness; even without working on it, taking stock of what has been done in this question might impact your research.

The next post in this sequence lay the groundwork for such considerations, by reviewing the literature on goal-directedness: the intuitions behind it, the proposed definitions, and the debates over the shape of a good solution to the problem.

New Comment
13 comments, sorted by Click to highlight new comments since:

A few other ways in which goal-directedness intersects with abstraction:

  • abstraction as an instrumentally convergent tool: to the extent that computation is limited but the universe is local, we'd expect abstraction to be used internally by optimizers of many different goals.
  • instrumental convergence to specific abstract models: the specific abstract model used should be relatively insensitive to variation in the goal.
  • type signature of the goal: to the extent that humans are goal-directed, our goals involve high-level objects (like cars or trees), not individual atoms.
  • embedded agency = abstraction + generality + goal-directedness. Roughly speaking, an embedded agent is a low-level system which abstracts into a goal-directed system, and that goal-directed system can operate across a wide range of environments requiring different behaviors.

what can be thrown out of the perfect model to get a simpler non-self-referential model (an abstraction) that is useful for a specific purpose?

Kind of tangential, but it's actually the other way around. The low-level world is "non-self-referential"; the universe itself is just one big causal DAG. In order to get a compact representation of it (i.e. a small enough representation to fit in our heads, which are themselves inside the low-level world), we sometimes throw away information in a way which leaves a simpler "self-referential" abstract model. This is a big part of how I think about agenty things in a non-agenty underlying world.

Thanks for the additional ideas! I especially concur about the type signature of goals and the instrumental convergence to abstract models.

Kind of tangential, but it's actually the other way around. The low-level world is "non-self-referential"; the universe itself is just one big causal DAG. In order to get a compact representation of it (i.e. a small enough representation to fit in our heads, which are themselves inside the low-level world), we sometimes throw away information in a way which leaves a simpler "self-referential" abstract model. This is a big part of how I think about agenty things in a non-agenty underlying world.

But there's a difference between the low-level world and a perfect model of the low-level world embedded inside the world, isn't it? Also, I don't see how the compact representation is self-referential. If you mean that they can be embedded into the world, that's not what I meant.

I'm not quite clear on what you're asking, so I'll say some things which sound relevant.

I'm embedded in the world, so my world model needs to contain a model of me, which means my world model needs to contain a copy of itself. That's the sense in which my own world model is self-referential.

Practically speaking, this basically means taking the tricks from Writing Causal Models Like We Write Programs, and then writing the causal-model-version of a quine. It's relatively straightforward; the main consequence is that the model is necessarily lazily evaluated (since I'm "too small" to expand the whole thing), and then the interesting question is which queries to the model I can actually answer (even in principle) and how fast I can answer them.

In particular, based on how game theory works, there's probably a whole class of optimization queries which can be efficiently answered in-principle within this self-embedded model, but it's unclear exactly how to set them up so that the algorithm is both correct and always halts.

My world model is necessarily "high-level" in the sense that I don't have direct access to all the low-level physics of the real world; I expect that the real world (approximately) abstracts into my model, at least within the regimes I've encountered. I probably also have multiple levels of abstraction within my world model, in order to quickly answer a broad range of queries.

Did that answer the question? If not, can you give an example or two to illustrate what you mean by self-reference?

Thanks a lot! I think my misunderstanding came from collapsing the computational complexity issues of self-referential simulation (expanding the model costs too much, as you mention) and the pure mathematical issue of defining such a model. In the latter sense, you can definitely have a self referential embedded model.

I'm embedded in the world, so my world model needs to contain a model of me, which means my world model needs to contain a copy of itself. That's the sense in which my own world model is self-referential.

I'm not sure why the last "need" is true. Is it because we're assuming my world model is good/useful? Because I can imagine a world model where I'm a black box, and so I don't need to model my own world model.

In theory I could treat myself as a black box, though even then I'm going to need at least a functional self model (i.e. model of what outputs yield what inputs) in order to get predictions out of the model for anything in my future light cone.

But usually I do assume that we want a "complete" world model, in the sense that we're not ignoring any parts by fiat. We can be uncertain about what my internal structure looks like, but that still leaves us open to update if e.g. we see some FMRI data. What I don't want is to see some FMRI data and then go "well, can't do anything with that, because this here black box is off-limits". When that data comes in, I want to be able to update on it somehow.

Trouble comes from self-reference: since the agent is part of the world, so is its model, and thus a perfect model would need to represent itself, and this representation would need to represent itself, ad infinitum. So the model cannot be exact.


What's the issue?

The quote sounds like an argument for non-existence of quines or of the context in which things like the diagonalization lemma are formulated. I think it obviously sounds like this, so raising nonspecific concern in my comment above should've been enough to draw attention to this issue. It's also not a problem Agent Foundations explores, but it's presented as such. Given your background and effort put into the post this interpretation of the quote seems unlikely (which is why I didn't initially clarify, to give you the first move). So I'm confused. Everything is confusing here, including your comment above not taking the cue, positive voting on it, and negative voting on my comment. Maybe the intended meanings of "model" and "being exact" and "representation" are such that the argument makes sense and becomes related to Agent Foundations?

I do appreciate you pointing out this issue, and giving me the benefit of the doubt. That being said, I prefer that comments clarify the issue raised, if only so that I'm more sure of my interpretation. The up and downvotes in this thread are I think representative of this preference (not that I downvoted your post -- I was glad for feedback).

About the quote itself, rereading it and rereading Embedded Agency, I think you're right about what I write not being an Agents Foundation problem (at least not one I know of). What I had in mind was more about non-realizability and self-reference in the context of decision/game theory. I seem to have mixed the two with naive Gödelian self-reference in my head at the time of writing, which resulted in this quote.

Do you think that this proposed change solves your issues?

"This has many ramifications, including non-realizability (the impossibility of the agent to contain an exact model of the world, because it is inside the world and thus smaller), self-referential issues in the context of game theory (because the model is part of the agent which is part of the world, other agents can access it and exploit it), and the need to find an agent/world boundary (as it's not given for free like in the dualistic perspective)."

Having an exact model of the world that contains the agent doesn't require any explicit self-references or references to the agent. For example, if there are two programs whose behavior is equivalent, A and A', and the agent correctly thinks of itself as A, then it can also know the world to be a program W(A') with some subexpressions A', but without subexpression A. To see the consequences of its actions in this world, it would be useful for the agent to figure out that A is equivalent to A', but it is not necessary that this is known to the agent from the start, so any self-reference in this setting is implicit. Also, A' can't have W(A') as a subexpression, for reasons that do admit an explanation given in the quote that started this thread, but at the same time A can have W(A') as a subexpression. What is smaller here, the world or the agent?

(What's naive Gödelian self-reference? I don't recall this term, and googling didn't help.)

Dealing with self-reference in definitions of agents and worlds does not require (or even particularly recommend) non-realizability. I don't think it's an issue specific to embedded agents, probably all puzzles that fall within this scope can be studied while requiring the world to be a finite program. It might be a good idea to look for other settings, but it's not forced by the problem statement.

non-realizability (the impossibility of the agent to contain an exact model of the world, because it is inside the world and thus smaller)

Being inside the world does not make it impossible for the agent to contain the exact model of the world, does not require non-realizability in its reasoning about the world. This is the same error as in the original quote. In what way are quines not an intuitive counterexample to this reasoning? Specifically, the error is in saying "and thus smaller". What does "smaller" mean, and how does being a part interact with it? Parts are not necessarily smaller than the whole, they can well be larger. Exact descriptions of worlds and agents are not just finite expressions, they are at least equivalence classes of expressions that behave in the same way, and elements of those equivalence classes can have vastly different syntactic size.

(Of course in some settings there are reasons for non-realizability to be necessary or to not be a problem.)

Thanks for additional explanations.

That being said, I'm not an expert on Embedded Agency, and that's definitely not the point of this post, so just writing stuff that are explicitly said in the corresponding sequence is good enough for my purpose. Notably, the section on Embedded World Models from Embedded Agency begins with:

One difficulty is that, since the agent is part of the environment, modeling the environment in every detail would require the agent to model itself in every detail, which would require the agent’s self-model to be as “big” as the whole agent. An agent can’t fit inside its own head.

Maybe that's not correct/exact/the right perspective on the question. But once again, I'm literally giving a two sentence explanations of what the approach says, not the ground truth or a detailed investigation of the subject.

Yeah, that was sloppy of the article. In context, the quote makes a bit of sense, and the qualifier "in every detail" does useful work (though I don't see how to make the argument clear just by defining what these words mean), but without context it's invalid.

Sorry for my last comment, it was more a knee-jerk reaction than a rational conclusion.

My issue here is that I'm still not sure of what would be a good replacement for the above quote, that still keeps intact the value of having compressed representations of systems following goals. Do you have an idea?