Selection Theorems: A Program For Understanding Agents

[-]evhub4y*140

Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.

[-]johnswentworth4y110

Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.

[-]evhub4y20

Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.

[-]Linda Linsefors4y110

Not sure how usefull this is, but I think this counts as a selection theorem.
(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)

https://proceedings.neurips.cc/paper/2021/hash/b9ed18a301c9f3d183938c451fa183df-Abstract.html

We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.

[-]johnswentworth4y50

That is definitely a selection theorem, and sounds like a really cool one! Well done.

[-]Thomas Kwa2y*40

I wish I could review this for the 2022 review, but it's from 2021.

I think this post is pretty valuable. The big takeaway for me was that any argument involving coherence should cache out in selection.

One caveat is that I haven't seen much research into selection theorems in the intervening couple of years, and adjacent things like inductive bias research don't seem to have good applications yet. Maybe it's too hard for where the field is right now.

[-]Vika3y40Review for 2021 Review

I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.

[-]Rohin Shah4y40

Planned summary for the Alignment Newsletter:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because they tell us likely properties of the agents we build.
As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any non-dominated agent can be represented as maximizing expected utility. (What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.) If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.
Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).
The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

Planned opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there is a money-like resource over which the agent has no terminal preferences). Similarly, I don’t expect this research agenda to find a selection theorem that says that an existential catastrophe occurs _assuming only that the agent is intelligent_, but I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, because we think the assumptions involved in the theorems are quite likely to hold. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I would not actively discourage anyone from doing this sort of research, and I think it would be more useful than other types of research I have seen proposed.

[-]johnswentworth4y20

A few comments...

Selection theorems are helpful because they tell us likely properties of the agents we build.

What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):

Properties of humans as agents (e.g. "human values")
Properties of agents which we intentionally aim for (e.g. what kind of architectural features are likely to be viable)
Properties of agents which we accidentally aim for (e.g. inner agency issues)

Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.

(What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.)

"Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)

If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.

... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.

(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)

[-]Rohin Shah4y40

Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:

This post proposes a research area for understanding agents: **selection theorems**. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns).
As an example, [coherence arguments](https://www.alignmentforum.org/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities) demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don’t lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
Coherence arguments aren’t the only kind of selection theorem. The <@good(er) regulator theorem@>(@Fixing The Good Regulator Theorem@) provides a set of scenarios under which agents learn an internal “world model”. The [Kelly criterion](http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter6.pdf) tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in [this followup post](https://www.alignmentforum.org/posts/N2NebPD78ioyWHhNm/some-existing-selection-theorems).
The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another [followup post](https://www.alignmentforum.org/posts/RuDD3aQWLDSb4eTXP/what-selection-theorems-do-we-expect-want) describes some useful properties for which the author expects there are useful selections theorems to prove.

New opinion:

People sometimes expect me to be against this sort of work, because I wrote <@Coherence arguments do not imply goal-directed behavior@>. This is not true. My point in that post is that coherence arguments _alone_ are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed.

[-]johnswentworth4y40

I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior

I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).

Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.

Agents selected by ML (e.g. RL training on games) also often have internal state.

[-]Rohin Shah4y40

Edited to

Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values

and

[...] the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state.

(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)

[-]johnswentworth4y40

Cool, looks good.

[-]Rohin Shah4y40

At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment.

The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.

[-]johnswentworth4y40

The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.)

That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI with.

(Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the "input variables" are for human values.)

Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.

[-]Charlie Steiner4y40

Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?

I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking model classes for agents?

I like Savage, so I'll pick on him. Before you even get into what he considers the "real" desiderata, he wants to say that there's a set of actions which are functions from states to consequences, and this set is closed under the operation of using one action for some arbitrary states and another action for the rest. But humans very don't work that way - I'd want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking.

Or if we're thinking about modeling humans to extract the "preferences" part of the model: Suppose Person A wants to get out a function that ranks actions, while Person B wants to learn a utility function, its domain of validity, and a custom world-model that the utility function lives in. What's the model for how something like a selection theorem will help them resolve their differences?

[-]johnswentworth4y20

You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.

In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.

[-]Charlie Steiner4y10

We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?

[-]johnswentworth4y*20

Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.

[-]Charlie Steiner4y10

Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.

[-]johnswentworth4y30

The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.

An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful.

More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.

[-]Charlie Steiner4y10

This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we're trying to force on it.

For this problem, which might not be what you're talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it's an inference method that trades off between model complexity and error in a different way.

(P.S.: I think there is an "obvious" way to do that, and it's MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)

[-]adamShimi4y30

Just posted an analysis of the epistemic strategies underlying selection theorems and their applications. Might be interesting for people who want to go further with selection theorem, either by proving one or by critiquing one.

[-]Gordon Seidoh Worley4y30

Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

51

Selection Theorems: A Program For Understanding Agents

51

What’s A Type Signature Of An Agent?

What’s A Selection Theorem?

How to work on Selection Theorems

New Theorems

Incremental Work

Up Next