Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort.

I think there's an important distinction to be made between work in agent foundations which is concerned with normative models, and work which is concerned with descriptive models. They are increasingly separate bodies of work, with different aims and different theories of change when it comes to alignment.


The normative branch is typified by the Embedded Agency sequence, and the whole thing can be summed up as 'The Hunt for Embedded AIXI'. Its goal is to figure out how to build an ideal agent in principle. Decision theory, infrabayesianism, and logical induction all come under the normative banner.

The descriptive branch is typified by John Wentworth's Basic Foundations for Agent Models sequence. Descriptive work aims to understand the agents we run into in the wild. Other examples include shard theory, Critch's Boundaries sequence, and the Discovering Agents paper.

Theories of Change


I'll start with the descriptive branch. The most ambitious version of its goal is to understand agency so well that in principle we could take an unabstracted, non-agentic description of a system - e.g. a physics-level causal graph, the weights in a neural network, or a cellular model of a squirrel - and identify what if any are its goals, world-model, and so on. If we could do that in principle, then in practice we could probably check whether an artificial agent is aligned, and maybe we could even do things like surgically modify its goals, or directly point to things we care about in its world-model. I think that's what John is aiming for. A less ambitious goal, which I think better describes the aims of shard theory, is to understand agency well enough that we can carefully guide the formation of agents' goals during ML training runs. 

Beyond that, I think everyone involved expects that descriptive work could lead to foundational insights that change our minds about which alignment strategies are most promising. In particular, these insights might answer questions like: whether intelligent entities are inevitably agents, whether agents are inevitably consequentialists, whether corrigibility is a thing, and whether we should expect to encounter sharp left turns.


The normative branch shares the conceptual clarification theory of change. I think there's a reasonable argument to be made that we should expect the theoretical ideal of agency to be much easier to understand than agency-in-practice, and that understanding it might provide most of the insight. But the normative branch also has a much more ambitious theory of change, which is something like: if we understand the theoretical ideal of agency well enough, we might be able to build an aligned AGI manually 'out of toothpicks and rubber bands'. I think this hope has fallen by the wayside in recent years, as the capabilities of prosaic AI have rapidly progressed. Doing it the hard way just seems like it will take too long.

An aligned AGI built out of toothpicks and rubber bands.


The Embedded Agency sequence identifies four rough subquests in The Hunt for Embedded AIXI. Most work in the normative branch can be thought of as attacking one or another of these problems. Many of the insights of that sequence are directly applicable to the descriptive case, but the names of the subproblems are steeped in normative language. Moreover, there are aspects of the descriptive challenge which don't seem to have normative analogues. It therefore seems worth trying to identify a seperate set of descriptive subproblems, and vaguely categorise descriptive work according to which of them it gets at. I'll suggest some subproblems here, with a view to using them as a basis for a literature review of the whole field once I've got some feedback and iterated them a bit.


First, a reminder of the four problems identified in the Embedded Agency sequence. These are things that AIXI doesn't have to deal with on account of being an uncomputable black box living outside of its environment. We can think of them as problems that an ambitious agent (or its creator) would encounter in the process of trying to achieve its goals in the real world. In contrast, the descriptive subproblems will look more like problems that we as modellers encounter in the process of trying to think of a physical system as an agent.

Decision Theory. AIXI's actions affect the world in a well-defined way, but embedded agents have to figure out whether they care about the causal, evidential, or logical implications of their choices.

Embedded World-Models. AIXI can hold every possible model of the world in its head in full detail and consider every consequence of its actions, but embedded agents are part of the world, and have limited space and compute with which to model it.

Robust Delegation. AIXI is unchanging and the only agent in town, but embedded agents can self-modify and create other agents. They need to ensure their successors are aligned.

Subsystem Alignment. AIXI is indivisible, but embedded agents are chunks of the world made up of subchunks. What if those subchunks are agents with their own agendas?

Suggested Descriptive Subproblems

To some extent these will be renamings of the normative problems, but each also has aspects that don't arise in its normative counterpart.

I/O Channels. Actions, observations, and cartesian boundaries aren't primitive: descriptive models need to define them. How do we move from a non-agentic model of the world to one with free will and counterfactuals?

Internal Components. Presumably agents contain things like goals and world-models,  but what else? And how do these components work mathematically?

Future Agents.  What is the relationship between an agent and its future self, or its successors? To what extent can goals be passed down the line?

Subagents and Superagents. Do agents contain subagents? When can the interaction of a group of agents be thought of as a superagent? How do the goals of subagents relate to the goals of superagents?

Identifying Agents. Can we determine which parts of the world contain agents, and read off their internal components? Should we expect our models of agency to be very accurate, like the models of physics, or just a rough guide, like the models of economics? And how close are agents in practice to normative ideals?


I/O Channels corresponds to Decision Theory; Internal Components corresponds to Embedded World-Models; Future Agents to Robust Delegation; and Subagents and Superagents to Subsystem Alignment. But the emphases of the problems are somewhat different. To take I/O Channels/Decision Theory as an example, defining actions and observations from lower-level phenomena is more obviously important in the descriptive case, and debating relative the merits of causal and evidential reasoning seems to be mostly a normative conern. But there's overlap: both are concerned with which parts of the world to draw a cartesian boundary around and consider the same agent.

Identifying Agents is a catch-all category for direct questions about how our mathematical models correspond to reality. This seems like a vitally important part of the descriptive challenge which doesn't have a normative analogue.


I think the distinction between normative and descriptive agent foundations work is a useful one to have in your head. The normative branch of the field hopes to gain insight by understanding the theoretical ideal of agency, with an outside chance of getting so much insight we can build an aligned AGI manually. The descriptive branch hopes to gain power by understanding the agents we encounter in practice, and hopefully pick up some theoretical insight along the way.

Am I missing any important subproblems of the descriptive challenge? Is there a better way to carve things up? Is the whole normative/descriptive division misguided? Let me know.

Thanks to Joern Stoehler for discussion and feedback.




New Comment
2 comments, sorted by Click to highlight new comments since:

Nice, thanks. It seems like the distinction the authors make between 'building agents from the ground up' and 'understanding their behaviour and predicting roughly what they will do' maps to the distinction I'm making, but I'm not convinced by the claim that the second one is a much stronger version of the first.

The argument in the paper is that the first requires an understanding of just one agent, while the second requires an understanding of all agents. But it seems like they require different kinds of understanding, especially if the agent being built is meant to be some theoretical ideal of rationality. Building a perfect chess algorithm is just a different task to summarising the way an arbitrary algorithm plays chess (which you could attempt without even knowing the rules).