I develop some contents—previously introduced in the Value Learning sequence by Rohin Shah—more formally, to clarify the distinction between agents with and without a goal. Then I present related work and make some considerations on the relation between safety and goal-directedness. The appendix contains some details on the used formalism and can be skipped without losing much information.

A brief preliminary

In the first post of the Value Learning sequence, Shah compares two agents that exhibit the same behaviour (a winning strategy) when playing Tic-Tac-Toe, but are different in their design: one applies the minimax algorithm to the setting and rules of the game, while the other one follows a lookup table—you can think of its code as a long sequence of if-else statements.

Shah highlights the difference in terms of generalisation: the first one would still win if the winning conditions were changed, while the lookup table would not. Generalisation is one of the components of goal-directedness, and lookup tables are among the least goal-directed agent designs.

Here I want to point at another difference that exists between agents with and without a goal, based on the concept of algorithmic complexity.


Most problems in AI consist in finding a function , called policy in some contexts, where and indicate the sets of possible actions and observations. A deterministic policy can be written as a string with indicating the action taken when is observed.

Here I consider a problem setting as a triplet where stands for some kind of environmental data—could be about, for example, the transition function in a MDP, or the structure of the elements in the search space . Since I want to analyse behaviour across different environments, instead of considering one single policy I’ll sometimes refer to a more general function (probably closer to the concept of “agent design”, rather than just “agent”) mapping environments to policies on them: .

Lookup table vs simple-reward RL

As written before, a lookup table is a policy that is described case-by-case, by explicitly giving the list of all observation-action pairs. Thus, a generic lookup table for a setting is expected to have Kolmogorov complexity : most lookup tables are incompressible.

Now let’s consider a policy generated via Reinforcement Learning. Such a policy can be written as , where is the training algorithm, and indicates an algorithmic procedure that gets the environmental data as input and returns a reward function to be used by the RL algorithm. Often, corresponds to the “work” done by the human designers who assign appropriate rewards to states, according to what they want the agent to achieve in the environment.

However, any policy could actually be expressed in the form with an appropriately constructed reward function , because of an argument analogous to the one that Shah showed in this other post. The relevant element for goal-directedness is the algorithmic complexity, compared to the environment size, of the reward function.

If the algorithmic procedure has low complexity, then I expect that

especially in large environments. As an example for this case, consider the policy that is the result of RL training on a maze with the same small negative reward assigned to every state except for the exit, which has reward instead. For this reward function, ) is low, since the procedure is only about recognising the exit square of the maze given as input . Moreover, the same exact procedure can be used to find an exiting policy in larger mazes.

The fact that low algorithmic complexity is a sign of goal-directed behaviour has an analogy in natural language. When we say the goal of an agent is to exit mazes, we are giving a short description of its policies across multiple environments, no matter how big they are. In other words, we are expressing the policies in a compressed form by using the simple reward function, which coincides with the goal of the agent.

On the other hand, consider a reward function that assigns lots of different values to all states in the maze, with no recognisable pattern. In this case, the algorithmic complexity of the reward function is approximately equal to the size of the observation set: the situation is the same as with incompressible lookup tables.

In the analogy with natural language, this corresponds to the fact that the only way of describing such policies would be to state what the action (or the reward) is for each observation. There would be no goal which we can use to give a short description of the agent behaviour.

The lookup table vs simple-reward RL contrast appears under different forms in other contexts as well. Consider search processes (in this case, ) that select one or more elements in a search space and take as input some data about the elements. Manually specified filters, that for each search space list which elements to take and which to discard, are generally as complex as the size of the search space, thus essentially identical to lookup tables and without a recognisable goal.

On the other hand, a search algorithm that uses the data to generate an ordering of the elements according to a simple evaluating function, and takes the best element, is succinctly described by the evaluating function itself. Moreover, the latter search process naturally extends to larger environments, while the filter needs one more specification for each element that is added to the search space.

The previous example with search processes smoothly leads to the consideration of an alternative formalism to standard optimisation: quantilizers. Briefly, instead of taking the best element in the ordering generated by the evaluation function, an element is chosen from the top portion of the same ordering, according to a probability distribution.

If a standard optimisation process, like the previous example, is described as , then a quantilizer can be expressed similarly as , with a negligible change in complexity.

In terms of goal-directedness, this corresponds to the fact that the two agents can be said to have more or less the same goal, since they are interested in the same property captured by the evaluating function. The difference lies in the degree of “directedness”: the first agent applies straightforward hard optimisation, while the quantilizer is, in a sense, more relaxed and possesses some safety properties (e.g. Lemma 1 in the original paper).

A similar comparison in the context of learning can be done between the optimal policy maximising a simple reward function, and a policy that does the same of the time but takes a default action of the time—something like waiting for new human instructions. These two agents have the same goal, but they pursue it differently, since the second one leaves more room for corrections.

Overall, when comparing agents with the same goal, it seems that less-directed agents trade a certain amount of performance for a gain in terms of safety.

Compression of complex policies is also listed as a favourable condition for mesa-optimisation (see section 2 in the paper). When searching for algorithms that can solve a certain class of problems, a bias in favour of short policies increases the likelihood that an algorithm which is itself an optimiser is chosen; ideally, if the bias for simplicity is strong enough, it may be possible to find an algorithm that generalises to problems outside of the original class. Unsurprisingly, such an algorithm is also more likely to be goal-directed.

Implications for safety

When the full policy of the agent can be described as the result of a relatively short algorithm applied to some short data representing the goal, as in the case of , we can interpret as a compressor that selects the goal-related data from all the environmental data , and as a decompressor that uses the goal-related data to generate the full policy. Thus, because of error propagation, we should expect arbitrarily large errors in the full policy if there is any kind of error in the specification of the goal-related data.

This slightly formal reasoning reflects the less formal argument that, if we design a powerful goal-directed AI supposed to act in large and varied environments, and we make a mistake in the specification of the goal, the resulting policy could be arbitrarily far from what we originally expected or wanted.

A fundamental safety question is whether it’s possible to design safe agents that have goals and act in the real world. We’ve seen before that, among agents with the same goal, some are safer than others, but it seems hard to tell if this is enough to avoid all possible bad outcomes.

Note, however, that even if “standard” goal-directed agents were unsafe, an alternative solution could be to design agents that still use some compression, but also have a certain amount of explicitly-specified ad-hoc behaviour for unique scenarios that wouldn’t be handled correctly otherwise. Such an agent would be an intermediate design between the two extremes shown before, i.e. lookup tables and agents with a clear goal.

Thanks to Adam Shimi, Joe Collman and Sabrina Tang for feedback.

This work was supported by CEEALAR (EA hotel).


  • The environmental data is underspecified and not completely formal, but the main ideas in the post should be clear enough anyway, so that it’s easy to criticise them or suggest new research directions.
  • Even though I showed the simpler case with deterministic policies over states, the reasoning is the same with stochastic policies over histories.
  • The use of Kolmogorov complexity actually requires fixing a Universal Turing Machine: the above analysis doesn’t change at its core. You can think of all mentioned algorithms as if they were written in the same programming language.
  • Kosoy has proposed a definition of goal-directed intelligence that also uses algorithmic complexity, but in a different way.
New Comment
8 comments, sorted by Click to highlight new comments since:

Planned summary for the Alignment Newsletter:

This post argues that a distinguishing factor of goal-directed policies is that they have low Kolmogorov complexity, relative to e.g. a lookup table that assigns a randomly selected action to each observation. It then relates this to quantilizers (AN #48 ) and <@mesa optimization@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@).

Planned opinion:

This seems reasonable to me as an aspect of goal-directedness. Note that it is not a sufficient condition. For example, the policy that always chooses action A has extremely low complexity, but I would not call it goal-directed.

The others in the AISC group and I discussed the example that you mentioned more than once. I agree with you that such an agent is not goal-directed, mainly because it doesn't do anything to ensure that it will be able to perform action A even if adverse events happen.

It is still true that action A is a short description of the behaviour of that agent and one could interpret action A as its goal, although the agent is not good at pursuing it ("robustness" could be an appropriate term to indicate what the agent is lacking).

Maybe the criterion that removes this specific policy is locality? What I mean is that this policy has a goal only on its output (which action it chooses), and thus a very local goal. Since the intuition of goals as short descriptions assumes that goals are "part of the world", maybe this only applies to non-local goals.

I wouldn't say goals as short descriptions are necessarily "part of the world".

Anyway, locality definitely seems useful to make a distinction in this case.

If we're talking about algorithmic complexity, there's a really important distinction, I think. In the space of actions, we have:

  • SEARCH: Search over an action space for an action that best achieves a goal
  • DISCRIMINATIVE: There is a function that goes directly from sensory inputs to the appropriate action (possibly with a recurrent state etc.)

Likewise, in the space of passive observations (e.g. classifying images), we have:

  • SEARCH: Search over a space of generative models for the model that best reproduces the observations (a.k.a. "analysis by synthesis")
  • DISCRIMINATIVE: There is a function that goes directly from sensory inputs to understanding / classification.

The search methods are generally:

  • Algorithmically simpler
  • More sample-efficient
  • Slower at run-time

(Incidentally, I think the neocortex does the "search" option in both these cases (and that they aren't really two separate cases in the brain). Other parts of the brain do the "discriminative" option in certain cases.)

I'm a bit confused about how my comment here relates to your post. If p is the goal ("win at chess"), the simplest search-based agent is just about exactly as complicated as p ("do a minimax search to win at chess"). But RL(p), at least with the usual definition of "RL", will learn a very complicated computation that contains lots of information about which particular configurations of pieces are advantageous or not, e.g. the ResNet at the core of AlphaZero.

Are you imagining that the policy π is searching or discriminative? If the latter, why are you saying that π is just as simple as p? (Or are you saying that?) "Win at chess" seems a lot simpler than "do the calculations described by the following million-parameter ResNet", right?

I'm not sure I understand the search vs discriminative distinction. If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?

About the goal of winning at chess: in the case of minimax search, generates the complete tree of the game using and then selects the winning policy; as you said, this is probably the simplest agent (in terms of Kolmogorov complexity, given ) that wins at chess—and actually wins at any game that can be solved using minimax/backward induction. In the case of , reads the environmental data about chess to assign reward to winning states and elsewhere, and represents an ideal RL procedure that exploits interaction with the environment to generate the optimal policy that maximises the reward function created by . The main feature is that in both cases, when the environment gets bigger and grows, the description length of the two algorithms given doesn't change: you could use minimax or the ideal RL procedure to generate a winning policy even for chess on a larger board, for example. If instead you wanted to use a giant lookup table, you would have to extend your algorithm each time a new state gets added to the environment.

I guess the confusion may come from the fact that is underspecified. I tried to formalise it more precisely by using logic, but there were some problems and it's still work in progress.

By the way, thanks for the links! I hope I'll learn something new about how the brain works, I'm definitely not an expert on cognitive science :)

Hmm, maybe we're talking past each other. Let's say I have something like AlphaZero, where 50,000 bytes of machine code trains an AlphaZero-type chess-playing agent, whose core is a 1-billion-parameter ConvNet. The ConvNet takes 1 billion bytes to specify. Meanwhile, the reward-calculator p, which calculates whether checkmate has occurred, is 100 bytes of machine code.

Would you say that the complexity of the trained chess-playing agent is 100 bytes or 50,000 bytes or 1 billion bytes?

I guess you're going to say 50,000, because you're imagining a Turing machine that spends a year doing the self-play to calculate the billion-parameter ConvNet and then immediately the same Turning machine starts running that ConvNet it just calculated. From the perspective of Kolmogorov complexity, it doesn't matter that it spends a year calculating the ConvNet, as long as it does so eventually.

By the same token, you can always turn a search-y agent into an equivalent discriminitive-y agent, given infinite processing time and storage, by training the latter on a googol queries of the former. If you're thinking about Kolmogorov complexity, then you don't care about a googol queries, as long as it works eventually.

Therefore, my first comment is not really relevant to what you're thinking about. Sorry. I was not thinking about algorithms-that-write-arbitrary-code-and-then-immediately-run-it, I was thinking about the complexity of the algorithms that are actually in operation as the agent acts in the world.

If my hand touches fire and thus immediately moves backwards by reflex, would this be an example of a discriminative policy, because an input signal directly causes an action without being processed in the brain?

Yes. But the lack-of-processing-in-the-brain is not the important part. A typical ConvNet image classifier does involve many steps of processing, but is still discriminative, not search-y, because it does not work by trying out different generative models and picks the one that best explains the data. You can build a search-y image classifier that does exactly that, but most people these days don't.

No worries, I think your comment still provides good food for thought!