In the small group of people who care about deconfusing goal-directedness and like Daniel Dennett's intentional stance, there's one paper everyone mentions: Agents and Devices, by Orseau, McGill and Legg. It's also pretty much the only paper ever mentioned on that topic.

Given that this paper seems to formalize the intentional stance enough for running experiments on it, and how I already wrote that a formalized intentional stance sounds like the most promising deconfusion of goal-directedness, I thought it was important to review it to clarify what I think is missing.

Spoiler: I don't find their proposal satisfying as a formalization of the intentional stance, but it's a great first step in making it more concrete and objectionable.

Thanks to Victoria Krakovna and Ramana Kumar who always ask me how my current attempt at deconfusing goal-directedness is related to this paper.

Summary

What's this paper about?

Basically, the author create two classes of hypotheses:

devices which are just programs with the right interface,
and agents which are optimal (technically -greedy optimal) policies for some reward functions.

These classes are combined into mixtures through priors -- speed prior over the devices and simplicity prior (technically switching prior) over the rewards/goals respectively. Next the two mixtures are combined in equal weights to form one mixture, which is used as prior on applying Bayes rules to some observed trajectory.

So after getting a trajectory, the result is a posterior on our mixture of agents and devices, which gives us both which of agent and device is more probable, but also the more probably hypothesis among those classes.

They actually run some experiments in a simple maze, which give the expected results: following walls and running in circles are considered as devices, following a goal or switching goals are considered as agents, and random behavior is considered as neither.

What I like

As a first attempt to make the intentional stance formal enough to experiment on and criticize, the paper does his job perfectly.

Here are also some ideas I found particularly interesting:

(The switching prior) I hadn't heard of it, but it sounds like a pretty neat way to add changes of goals to the formalization.
(Using the speed prior to deal with interpreters) I need to read this part in more detail, but the choice of a speed prior over the simplicity prior to penalize universal interpreters and not make everything a device is quite elegant and principled.
(Experimental framing) Since I tend to focus on theoretical questions, I hadn't thought before reading this paper about how to translate such investigations into an experimental framework. I'll probably reuse this for further deconfusion research.

This doesn't mean I don't see issues with important points of their formalization, though. Let's look at that next.

Issues

Devices?

Let's start with the more nitpicky issue: I'm not sure what devices are supposed to represent in Dennett's view. The authors write that they capture the physical stance, but the physical stance is literally about simulating the laws of physics and seeing what happens.

Instead, I think devices capture the design stance. Which the paper says it ignores explicitly!

It doesn't really matter, but that can leads to some confusion, especially when their examples of devices are things like rocks, but the actually device behaviors on the experiment are things like running in circle.

What are Goals?

In this paper, every reward function is a possible goal. We have known for quite some times that this leads to a problem of spurious goals capturing exactly the behavior of the system instead of somewhat compressing it (by rewarding only states on the path of the system). The authors of this paper are aware of such problems, and address them through adding a penalty to agents (see With the speed prior in Section 3.2).

I haven't been able to think of a concrete example breaking this scheme, but I still have the general criticism that this is quite ad-hoc: there is no intuition or reasoning for setting the penalty that specific way. It also ends up biasing against goals that aren't spurious, which might make the prior favor devices for wrong reasons.

Another type of spurious goals that aren't addressed at all in the paper are the trivial or almost trivial goals. Things like "constant reward" or "constant reward except in one state" are part of the set of goals, they have all or almost all policies as their optimal policies, and so predict as best as possible any behavior. This issue would be probably mitigated if the prior over agent was over the optimal policies (because then every policy gets a boost and so none gets a relative boost), but the prior is on rewards, meaning that actually running the mixture with the real priors would always give the trivial goal as the best predictor.

How can the experiments work then? Because they hand-pick 4 goals. Which doesn't mean the experiments are worthless, but it's a problem if you have no way to search over goals without handpicking them.

Optimality $\neq$ Goal-Directedness

As I wrote in the summary, agents are basically ( $ϵ$ -greedy versions of) optimal policies for reward functions. But that misses a big point of the intentional stance, which depends on the beliefs of the intentional system. Since these beliefs might be incomplete or even false, the intentional stance often predicts non-optimality.

I like the example an average chess player, which we can quite easily see is trying to win despite having no chance at all against a grandmaster or AlphaGo.

Isn't the $ϵ$ -greediness here to deal with that issue? Yes, but the band-aid doesn't work: intentional systems can be less than optimal in ways vastly different from random and uniform chance of error. Once again, the average chess player example comes to mind.

I have argued recently that replacing this notion of optimality is quite important, since we want to be able to detect goal-directedness when the system is still below human level or not super intelligent rather than when it reaches very close to optimality.

Different focus

Lastly, I have no so much a technical issue than a framing one: the vast majority of this paper is spent on clarifying the mathematical details of the formalization, without much space allocated to the actual intuitions and deconfusion. This is quite symptomatic of how scientific papers are "supposed" to be written nowadays, yet how much of a missed opportunity it feels like! From the reference to Dennett and the glimpses into intuitions, I'm convinced that the authors have thought long and hard about the topic, and probably have many important insights to share.

Conclusion: a valuable first step

Don't get the idea from this post that I hate this paper. On the contrary, I'm really glad and excited that other people worked on this subject, and that they pushed forward to experimental results. Most of my criticisms are in the end about simplifications necessary to get a concrete experiment running. I'm complaining that they didn't solve the deconfusion around the topic, but that doesn't seem like they objective anyway.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

12

A review of "Agents and Devices"

12

Summary

What I like

Issues

Devices?

What are Goals?

Optimality $\neq$ Goal-Directedness

Different focus

Conclusion: a valuable first step

12

A review of "Agents and Devices"

12

Summary

What I like

Issues

Devices?

What are Goals?

Optimality ≠ Goal-Directedness

Different focus

Conclusion: a valuable first step

Optimality $\neq$ Goal-Directedness