Many people believe that understanding "agency" is crucial for alignment, but as far as I know, there isn't a canonical list of reasons why we care about agency. Please describe any reasons why we might care about the concept of agency for understanding alignment below. If you have multiple reasons, please list them in separate answers below.

Please also try to be specific as possible about what our goal is in the scenario. For example:

We want to know what an agent is so that we can determine whether or not a given AI is a dangerous agent

Whilst useful isn't quite as good as:

We have an AI which may or may not have goals aligned with us and we want to know what will happen if these goals aren't aligned. In particular, we are worried that the AI may develop instrumental incentives to seek power and we want to use interpretability tools to help us figure out how worried we should be for a particular system.

We can imagine a scale of such systems according to how it behaves in novel situations:

  • On the low end of this scale, we have a system that only follows extremely broad heuristics that it learned during training
  • On the high end of this scale, we have a system that uses general reasoning capabilities to discover new and innovative strategies even if it has never used anything like this strategy before, or seen it in the training data

We can think of such a scale as a concretisation of agency. This scale provides an indication of both:

  • How dangerous a system is likely to be: though a system with excellent heuristics and very little agency could be dangerous as well
  • What kind of precautions we might want to take: for example, how much can we rely on our off-switch to disable the agent

It's plausible that interpretability tools might be able to give us some idea of where an agent is on this scale just by looking at the weights. So having a clear definition of this scale could help by clarifying what we are looking for.

In a few days, I'll add any use cases I'm aware of myself that either haven't been covered or that I don't think have been adequately explained by different answers.



New Answer
New Comment

2 Answers sorted by

Max H


Agency may be a convergent property of most AI systems (or at least, of many systems people are likely to try to build), once those systems reach a certain capability level. The simplest and most useful way to predict the behavior of such systems may therefore be to model them as agents.

Perhaps we can avoid the problems posed by agency by building only tool AI. In that case, we probably still need a deep understanding of agency to make sure we avoid building an agent by accident. Instrumental convergence may imply that all sufficiently powerful AI systems start looking like agents eventually, past a certain point. Though, when a particular system is best modeled as an agent may depend on the particulars of that system, and we may want to push that point out as far as possible.

Boiling this down to a single specific reason about why we should care about agency: the concept of agency is likely to be key for creating simple, predictively accurate models of many kinds of powerful AI systems, regardless of whether the builders of those systems:

  • deeply understand the concepts of agency (or alignment) themselves
  • are deliberately trying to build an agent, or deliberately trying to avoid that, or just trying to build the most powerful system as fast as possible, without explicitly trying to avoid or target agency at all. (We seem to be in a world in which different people are trying all three of these things simultaneously.)

A few arguments or stubs of arguments for why the bolded claim is correct and important: 

  • Agency is already a useful model of humans and human behavior in many situations.
  • Agency is already a useful model of some current AI systems: Mu Zero, Dreamer, Stockfish in the domains of their respective game worlds. It might soon be a useful model of constructions like Auto-GPT, in the domain of the real world.
  • The hypothesis that agency is instrumentally convergent means that it will be important in understanding all AI systems above a certain capability level.

Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.

He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.

3 comments, sorted by Click to highlight new comments since:

I spent a few hours today just starting to answer this question, and only got as far as walking through what this "agency" thing is which we're trying to understand. Since people have already asked for clarification on that topic, I'll post it here as a standalone mini-essay. Things which this comment does not address, which I may or may not get around to writing later:

  • Really there should be a few more multi-agent phenomena at the end - think markets, organizations/firms, Schelling points, governance, that sort of thing. I ran out of steam before getting to those.
  • What might "understanding" each of these phenomena look like?
  • How might it all fit together into a coherent whole picture? (Though hopefully the parts below are enough to start to see the unifying structure.)
  • How would better understanding of each of these phenomena individually yield incremental progress on various alignment subgoals? (Basically any of them would be incrementally useful for multiple alignment approaches/subproblems.)
  • How would a unified understanding of all these pieces address the hard parts of alignment? In particular, how could they rule out large classes of potential unknown unknowns?

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

So there's this thing where everything interacts with everything else, but mostly not directly. A sled's motion down a hill is influenced, to varying degrees, by motions of far-off stars or by magma flows in the earth's crust or by the fashion choices of teenagers at a nearby high school. But those effects are some combination of (a) small, and (b) mediated by things which interact with the sled more directly, like its weight or the coefficient of friction between sled and hill. This phenomenon - most interactions being mediated by a few factors - is a necessary precondition to science working at all in our high-dimensional world. Otherwise, reproducible outcomes would require that we control way too many things to ever realistically achieve reproducibility.

Building on that, there's also this thing where a biological cell interacts with its surroundings mostly via specific sensors/channels on the membrane, despite all sorts of complex stuff happening inside. Or a deposit bank interacts with its customers mostly via fancy versions of "you put money in and take money out, the bank says 'no' if the amount you try to take out is greater than the amount you put in", despite lots of complex stuff going on behind the scenes at the bank to make it work.

These are "boundaries": some relatively-large/complex systems interact with the rest of the world only through relatively-narrow/simple information-channels. We need some notion of boundaries, and of interactions flowing across those boundaries, in order to carve out some subsystem to call an "agent".

The Basics: Agency

So there's this thing where a thermostat senses the initial temperature of a room, and then does different things (like e.g. activating heating or cooling) depending on the initial temperature, in such a way that the final temperature consistently ends up roughly the same, for many different initial temperatures?

Or a bacterium senses how sugar concentrations change as it swims along, and then does different things (like e.g. continuing forward or tumbling around to face a random new direction) depending on how the sugar concentration changes, in such a way that it ends up in an area with lots of sugar, for many different initial positions or sugar concentration landscapes?

Or most animals will look and listen and smell around themselves, and then do different things (like e.g. run or fly or swim different directions, or bite, or stay very still, or...) depending on what they see/hear/smell, in such a way that they end up eating food and not being eaten themselves (mostly, over short time horizons) and having children, for many different configurations of the trees and rocks and plants and animals around them?

That's the most basic form of "agency": taking different actions depending on observations, in order to achieve a consistent outcome (or class of outcomes).

The next few phenomena follow from that basic idea: they either allow a system to achieve a consistent outcome more robustly (i.e. across more initial conditions), or to achieve a more specific consistent outcome, or they're the "easiest" way (in a statistical sense) to achieve a consistent outcome across many different conditions.


So there's this thing where an animal or plant develops specialized organs, which interact with the rest of the organism only in relatively simple, specialized ways. Or an organization has many departments, which specialize in particular roles and present a simplified API to the rest of the company.

These are "modules": subsystems with boundaries of their own, interacting with the rest of the system through relatively-limited/simple information channels.


So there's this thing where a human wants milk for their coffee and doesn't have any, and they break this problem up into subproblems. One subproblem is to drive to the store. Another is to find the milk within the store. A third is to make enough money to pay for the milk. These subproblems are mostly-independent: the human mostly doesn't need to think about the details of finding milk within the store in order to drive to the store, nor do they need to think about driving to the store in order to make money.

Or, an organism has organs/organelles with specialized roles which interact in relatively-limited/simple ways. In order for those organs to solve the organism's problems, they must each handle subproblems which are mostly-independent of the others (else the organs would need to pass a lot more information between themselves to solve the organism's top-level problems.) Same with departments of a company.

This is "factorization", a dual in some sense to modules: when faced with a problem, break it up into subproblems which can be solved mostly-independently.


So there's this thing where a biological cell needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often cells can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. But we also expect that the cell usually won't spend energy to turn B into A and spend energy to turn A into B at the same time; energy is a scarce resource, and we expect that the bacterium can produce more progeny (on average) if it doesn't waste resources that way. So, the bacterium can achieve higher fitness if it represses either the A -> B pathway or the B -> A pathway at any given time, depending on which metabolite is more abundant. (See here for more detail on this example and how it relates to utility maximization, plus a bunch of meta discussion.)

Or, consider the toy example of a hospital administrator budgeting to save as many lives as possible. If the administrator spends $1M on a liver for someone who needs a transplant, but does not spend $100k on a dialysis machine which will save 3 lives, then the administrator has failed to budget in a way which saves as many lives as possible. They could save strictly more lives on the same budget by taking the dialysis machine over the liver.

That's "coherence": taking multiple actions in different times/places, in such a way that the actions together are pareto optimal with respect to scarce resources.

World Models

So there's this thing where a human keeps a map in their head of the stuff around them, including outside their direct line of sight. (You can tell humans do this because if they turn around and see some big obvious thing behind them which was not there last time they looked, they will be surprised, whereas if they see some big obvious thing which was there, they will not be surprised.) And that map constantly updates as new information comes in, in such a way that the map continues to track stuff around the human pretty robustly, even if there's weird stuff which messes it up for a little while.

Even e-coli, when swimming along a sugar gradient, have an internal molecular species whose concentration roughly tracks the rate-of-change of the external sugar concentration as the bacterium swims. It's a tiny internal model of the e-coli's environment. More generally, cells often use some internal molecular species to track some external state, and update that internal representation as new information comes in.

That's a "world model": some internal stuff which consistently tracks the state of (some parts of) the external world, and updates to continue tracking that external state as new information comes in.

General-Purpose Search

So there's this thing with humans where you can give them pretty arbitrary tasks, from assembling some furniture to coding an app to planning an invasion, and they'll go figure out how to do it. In particular, humans can come up with plans to do pretty arbitrary tasks, before actually starting the tasks. (And of course competent humans usually iteratively update those plans as they try stuff and new information changes their world-model.) This is in contrast to fixed strategies, which can't update to many new tasks or adjust as new information comes in.

The part which comes up with the plan, and updates it in tandem with changes to the world-model, is "general-purpose search": some internal method which can find strategies to achieve a wide variety of goals across a wide variety of (modeled) external world-states. (More on what general purpose search is/isn't here.)


So there's this thing where some animals recognize themselves in a mirror, and some don't. (You can tell this from the animal e.g. trying to fight with the reflection or scare it away, vs e.g. noticing something sneaking up behind the reflection and then turning to see what's behind them.)

Or humans explicitly think about themselves, and talk about themselves, their own thought processes, how they're perceived by others, yada yada yada. Indeed, it's hard to get humans to stop thinking about themselves for a short while.

This is "reflection": a system represents itself, not just in the trivial way that everything "represents itself", but within its own world-model, including representations of relationships to all the external stuff represented in the world model.


So there's this thing where you can show a toddler an apple and say "apple", repeat with maybe three different apples, and from then on the toddler will mostly interpret "apple" the same way most other humans do. In the minds of two different humans, the word will map to internal representations of roughly-the-same stuff in the environment. Furthermore, words can be composed together in an exponentially huge variety of ways, and different humans will still end up mapping the words to internal representations of roughly-the-same stuff in the environment. (Not super consistently, unfortunately, but enough that humans are able to communicate at all, which is rather remarkable when dealing with an exponentially large space of potential meanings.)

This is "language": two systems coordinate to pass signals between them which map to internal representations of roughly-the-same stuff in the environment.

Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?

They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.