Selection Theorems: A Program For Understanding Agents

19Evan Hubinger

13johnswentworth

3Evan Hubinger

16Linda Linsefors

7johnswentworth

6Thomas Kwa

6Victoria Krakovna

4Rohin Shah

2johnswentworth

4Rohin Shah

4johnswentworth

4Rohin Shah

4johnswentworth

4Rohin Shah

5johnswentworth

7Charlie Steiner

2johnswentworth

2Charlie Steiner

2johnswentworth

2Charlie Steiner

3johnswentworth

2Charlie Steiner

4Adam Shimi

4Gordon Seidoh Worley

New Comment

Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.

Not sure how usefull this is, but I think this counts as a selection theorem.

(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)

https://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html

We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.

I wish I could review this for the 2022 review, but it's from 2021.

I think this post is pretty valuable. The big takeaway for me was that any argument involving coherence should cache out in selection.

One caveat is that I haven't seen much research into selection theorems in the intervening couple of years, and adjacent things like inductive bias research don't seem to have good applications yet. Maybe it's too hard for where the field is right now.

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.

As an example, [coherence arguments](

https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](

http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](

https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

Planned opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.

A few comments...

Selection theorems are helpful because they tell us likely properties of the agents we build.

What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):

- Properties of humans as agents (e.g. "human values")
- Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
- Properties of agents which we accidentally aim for (e.g. inner agency issues)

Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.

(What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.)

"Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)

If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.

... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.

(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).

As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).

The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

New opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior

I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).

Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Agents selected by ML (e.g. RL training on games) also often have internal state.

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values

and

[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

At the same time,

better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need *different* assumptions, because some of the currently standard assumptions are wrong - like utility functions.)

That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI *with*.

(Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the "input variables" are for human values.)

Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.

Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?

I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking model classes for agents?

I like Savage, so I'll pick on him. Before you even get into what he considers the "real" desiderata, he wants to say that there's a set of actions which are functions from states to consequences, and this set is closed under the operation of using one action for some arbitrary states and another action for the rest. But humans very don't work that way - I'd want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking.

Or if we're thinking about modeling humans to extract the "preferences" part of the model: Suppose Person A wants to get out a function that ranks actions, while Person B wants to learn a utility function, its domain of validity, and a custom world-model that the utility function lives in. What's the model for how something like a selection theorem will help them resolve their differences?

You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.

In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.

We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?

Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to *interface* with our models in any other way than querying for probabilities over the low-level state of the system.

Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.

The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.

An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then *any* point estimate will be very misleading/confusing/antihelpful.

More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.

This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we're trying to force on it.

For this problem, which might not be what you're talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it's an inference method that trades off between model complexity and error in a different way.

(P.S.: I think there is an "obvious" way to do that, and it's MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)

Just posted an analysis of the epistemic strategies underlying selection theorems and their applications. Might be interesting for people who want to go further with selection theorem, either by proving one or by critiquing one.

Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.

What’s the type signature of an agent?

For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they

don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs -Coherence Theoremsimply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, andhuman goals seem to be over latent variables in humans’ world models.And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they

embedin their low-level world? These are all questions about the type signatures of agents.One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking,

a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.We already have many Selection Theorems:

Coherence and Dutch Book theorems,Good RegulatorandGooder Regulator, theKelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.The quest for better Selection Theorems has a lot of “surface area”- lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requiresrelativelylittle ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time,better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.Outline of this post:

## What’s A Type Signature Of An Agent?

We’ll view the “type signature of an agent” as an answer to three main questions:

A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.

For example,

coherence theoremsshow that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.Coherence theorems fall short of what we ultimately want in a lot of ways: neither the assumptions nor the type signature are quite the right form for real-world agents. (More on that later.) But they’re a good illustration of what a selection theorem is, and how it tells us about the type signature of agents.

Here are some examples of “type signature” questions for specific aspects of agents:

within the world model?Does the agent perform search/optimization within the world model, or in the world directly?Does agent-like behavior imply agent-like internal architecture?## What’s A Selection Theorem?

A Selection Theorem tells us something about what agent type signatures will be selected for in some broad class of environments. Two important points:

everyquestion about agent type signatures; it just needs to tell ussomethingabout agent type signatures.For instance, the

subagents argumentsays that, when our “agents” have internal state in a coherence-theorem-like setup, the “goals” will be pareto optimality over multiple utilities, rather than optimality of a single utility function. This says very little about embeddedness or world models or internal architecture; it addresses only one narrow aspect of agent type signatures. And, like the coherence theorems, it doesn’tdirectlytalk about selection; it just says that any strategy which doesn’t fit the pareto-optimal form is strictly dominated by some other strategy (and therefore we’d expect that other strategy to be selected, all else equal).Most Selection Theorems, in the short-to-medium term, will probably be like that: they’ll each address just one particular aspect of agent type signatures. That’s fine. As long as the assumptions are general enough and realistic enough, we can use lots of theorems together to narrow down the space of possible types.

Eventually, I do expect that most of the core ideas of Selection Theorems will be unified into a small number of Fundamental Theorems of Agency - perhaps even a single theorem. But that’s not a necessary assumption for the usefulness of this program, and regardless, I expect a lot of progress on theorems addressing specific aspects of agent type signatures before then.

## How to work on Selection Theorems

## New Theorems

The most open-ended way to work on the Selection Theorems program is, of course, to come up with new Selection Theorems.

If you’re relatively-new to this sort of work and wondering how one comes up with useful new theorems, here are some possible starting points:

frame, and try to apply it. For example, I’ve been getting surprisingly a lot of mileage out of thecomparative advantage framelately; it turns out to give some neat variants of the Coherence Theorems.Also, take a look at

What’s So Bad About Ad-Hoc Mathematical Definitions?to help build someuseful aestheticintuitions.## Incremental Work

This is work which starts from one or more existing selection theorem(s), and improves on them somehow.

Some starting points with examples where I’ve personally found them useful before:

subagentsidea started from trying (and failing) to apply Coherence Theorems to financial markets.alogical inductor implemented as a market; Iextended thatto show thatanylogical inductor is behaviorally equivalent to a market.Gooder Regulatorwas basically that.A couple other approaches for which I don’t have a great example from my own work, but which I expect to be similarly fruitful:

## Up Next

I currently have two follow-up posts planned:

These are explicitly intended to help people come up with ways to contribute to the Selection Theorems program.