What’s the type signature of an agent?
For instance, what kind-of-thing is a “goal”? What data structures can represent “goals”? Utility functions are a common choice among theorists, but they don’t seem quite right. And what are the inputs to “goals”? Even when using utility functions, different models use different inputs - Coherence Theorems imply that utilities take in predefined “bet outcomes”, whereas AI researchers often define utilities over “world states” or “world state trajectories”, and human goals seem to be over latent variables in humans’ world models.
And that’s just goals. What about “world models”? Or “agents” in general? What data structures can represent these things, how do they interface with each other and the world, and how do they embed in their low-level world? These are all questions about the type signatures of agents.
One general strategy for answering these sorts of questions is to look for what I’ll call Selection Theorems. Roughly speaking, a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments. In inner/outer agency terms, it tells us what kind of inner agents will be selected by outer optimization processes.
We already have many Selection Theorems: Coherence and Dutch Book theorems, Good Regulator and Gooder Regulator, the Kelly Criterion, etc. These theorems generally seem to point in a similar direction - suggesting deep unifying principles exist - but they have various holes and don’t answer all the questions we want. We need better Selection Theorems if they are to be a foundation for understanding human values, inner agents, value drift, and other core issues of AI alignment.
The quest for better Selection Theorems has a lot of “surface area” - lots of different angles for different researchers to make progress, within a unified framework, but without redundancy. It also requires relatively little ramp-up; I don’t think someone needs to read the entire giant corpus of work on alignment to contribute useful new Selection Theorems. At the same time, better Selection Theorems directly tackle the core conceptual problems of alignment and agency; I expect sufficiently-good Selection Theorems would get us most of the way to solving the hardest parts of alignment. Overall, I think they’re a good angle for people who want to make useful progress on the theory of alignment and agency, and have strong theoretical/conceptual skills.
Outline of this post:
- More detail on what “type signatures” and “Selection Theorems” are
- Examples of existing Selection Theorems and what they prove (or assume) about agent type signatures
- Aspects which I expect/want from future Selection Theorems
- How to work on Selection Theorems
What’s A Type Signature Of An Agent?
We’ll view the “type signature of an agent” as an answer to three main questions:
- Representation: What “data structure” represents the agent - i.e. what are its high-level components, and how can they be represented?
- Interfaces: What are the “inputs” and “outputs” between the components - i.e. how do they interface with each other and with the environment?
- Embedding: How does the abstract “data structure” representation relate to the low-level system in which the agent is implemented?
A selection theorem typically assumes some parts of the type signature (often implicitly), and derives others.
For example, coherence theorems show that any non-dominated strategy is equivalent to maximization of Bayesian expected utility.
- Representation: utility function and probability distribution.
- Interfaces: both the utility function and distribution take in “bet outcomes”, assumed to be specified as part of the environment. The outputs of the agent are “actions” which maximize expected utility under the distribution; the inputs are “observations” which update the distribution via Bayes’ Rule.
- Embedding: “agent” must interact with “environment” only via the specified “bets”. Utility function and distribution relate to low-level agent implementation via behavioral equivalence.
Coherence theorems fall short of what we ultimately want in a lot of ways: neither the assumptions nor the type signature are quite the right form for real-world agents. (More on that later.) But they’re a good illustration of what a selection theorem is, and how it tells us about the type signature of agents.
Here are some examples of “type signature” questions for specific aspects of agents:
- World models
- Does the agent have a world model or models?
- What data structure can represent an agent’s world model? (Probability distributions are the most common choice.)
- How does the agent’s world model correspond to the world? (For instance, which physical things do the random variables in a probability distribution correspond to, if any?)
- What’s the relationship between the abstract “world model” and the physical stuff from which the world model is built?
- Does the agent have a goal or goals?
- What data structure can represent an agent’s goal? (Utility functions are the most common choice.)
- How does the goal correspond to the world - especially if it’s evaluated within the world model?
- Does the agent have well-defined goals or world models or other components?
- Does the agent perform search/optimization within the world model, or in the world directly?
- What are the agent’s “inputs” and “outputs” - e.g. actions and observations?
- Does agent-like behavior imply agent-like internal architecture?
- What’s the relationship between the abstract “agent” and the physical stuff from which the agent is built?
What’s A Selection Theorem?
A Selection Theorem tells us something about what agent type signatures will be selected for in some broad class of environments. Two important points:
- The theorem need not directly talk about selection - e.g. it could state some general property of optima, of “broad” optima, of “most” optima, or of optima under a particular kind of selection pressure (like natural selection or financial profitability).
- Any given theorem need not address every question about agent type signatures; it just needs to tell us something about agent type signatures.
For instance, the subagents argument says that, when our “agents” have internal state in a coherence-theorem-like setup, the “goals” will be pareto optimality over multiple utilities, rather than optimality of a single utility function. This says very little about embeddedness or world models or internal architecture; it addresses only one narrow aspect of agent type signatures. And, like the coherence theorems, it doesn’t directly talk about selection; it just says that any strategy which doesn’t fit the pareto-optimal form is strictly dominated by some other strategy (and therefore we’d expect that other strategy to be selected, all else equal).
Most Selection Theorems, in the short-to-medium term, will probably be like that: they’ll each address just one particular aspect of agent type signatures. That’s fine. As long as the assumptions are general enough and realistic enough, we can use lots of theorems together to narrow down the space of possible types.
Eventually, I do expect that most of the core ideas of Selection Theorems will be unified into a small number of Fundamental Theorems of Agency - perhaps even a single theorem. But that’s not a necessary assumption for the usefulness of this program, and regardless, I expect a lot of progress on theorems addressing specific aspects of agent type signatures before then.
How to work on Selection Theorems
The most open-ended way to work on the Selection Theorems program is, of course, to come up with new Selection Theorems.
If you’re relatively-new to this sort of work and wondering how one comes up with useful new theorems, here are some possible starting points:
- Study examples of evolved agents to see what kind of type signatures they develop under what conditions. I recommend coming from as many different angles as possible - i.e. ML, economics, and biology - to build intuitions.
- Once you have some intuition or empirical fact from some specific examples or a particular field, try to expand it to more general agents and selection processes.
- Bio example: sessile (i.e. immobile) organisms don’t usually cephalize (i.e. develop brains). Can we turn this into a general theorem about agents?
- Pick a frame, and try to apply it. For example, I’ve been getting surprisingly a lot of mileage out of the comparative advantage frame lately; it turns out to give some neat variants of the Coherence Theorems.
- Start from agent type signatures - what type signature makes sense intuitively, based on how humans work? What selection processes would give rise to that type signature, and can you prove it?
- Start from selection processes. What type signatures seem intuitively likely to be selected? Can you prove it?
Also, take a look at What’s So Bad About Ad-Hoc Mathematical Definitions? to help build some useful aesthetic intuitions.
This is work which starts from one or more existing selection theorem(s), and improves on them somehow.
Some starting points with examples where I’ve personally found them useful before:
- Take an existing selection theorem, try to apply it to some real-world agency system or some system under selection pressure, see what goes wrong, and fix it. For instance, the subagents idea started from trying (and failing) to apply Coherence Theorems to financial markets.
- Take some existing theorem and strengthen it. For instance, the original logical inductors piece showed the existence of a logical inductor implemented as a market; I extended that to show that any logical inductor is behaviorally equivalent to a market.
- Take some existing theorem with a giant gaping hole and fix the hole. Gooder Regulator was basically that.
A couple other approaches for which I don’t have a great example from my own work, but which I expect to be similarly fruitful:
- Take two existing selection theorems and unify them.
- Take a selection theorem mainly designed for a particular setting (e.g. financial/betting markets) and back out the exact requirements needed to apply it in more general settings.
- Empirical verification, i.e. check that an existing theorem works as expected on some real system. This is most useful when it fails, but success still helps us be sure our theorems aren’t missing anything, and the process of empirical testing forces us to better understand the theorems and their assumptions.
I currently have two follow-up posts planned:
- One post with some existing Selection Theorems, which is already written and should go up later this week. [Edit: post is up.]
- One post on agent type signatures for which I expect/want Selection Theorems - in other words, conjectures. This one is not yet written, and I expect it will go up early next week. [Edit: post is up.]
These are explicitly intended to help people come up with ways to contribute to the Selection Theorems program.
Have you seen Mark and my “Agents Over Cartesian World Models”? Though it doesn't have any Selection Theorems in it, and it just focuses on the type signatures of goals, it does go into a lot of detail about possible type signatures for agent's goals and what the implications of those type signatures would be, starting from the idea that a goal can be defined on any part of a Cartesian boundary.
Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.
Thanks! I hope the post is helpful to you or anyone else trying to think about the type signatures of goals. It's definitely a topic I'm pretty interested in.
Not sure how usefull this is, but I think this counts as a selection theorem.
(Paper by Caspar Oesterheld, Joar Skalse, James Bell and me)
We played around with taking learning algorithms designed for multi armed bandit problems (your action matters but not your policy) and placing them in Newcomblike environments (both your acctual action and your probability distribution over actions matters). And then we proved some stuf about their behaviour.
That is definitely a selection theorem, and sounds like a really cool one! Well done.
I like this research agenda because it provides a rigorous framing for thinking about inductive biases for agency and gives detailed and actionable advice for making progress on this problem. I think this is one of the most useful research directions in alignment foundations since it is directly applicable to ML-based AI systems.
Planned summary for the Alignment Newsletter:
A few comments...
What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):
Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.
"Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)
... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.
(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)
Thanks for this and the response to my other comment, I understand where you're coming from a lot better now. (Really I should have figured it out myself, on the basis of this post.) New summary:
I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Agents selected by ML (e.g. RL training on games) also often have internal state.
(For the second one, that's one of the reasons why I had the weasel word "could", but on reflection it's worth calling out explicitly given I mention it in the previous sentence.)
Cool, looks good.
The former statement makes sense, but can you elaborate on the latter statement? I suppose I could imagine selection theorems revealing that we really do get alignment by default, but I don't see how they quickly lead to solutions to AI alignment if there is a problem to solve.
The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.)
That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI with.
(Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the "input variables" are for human values.)
Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.
Hm. Suppose sometimes I want to model humans as having propositional beliefs, and other times I want to model humans as having probabilistic beliefs, and still other times I want to model human beliefs as a set of contexts and a transition function. What's stopping me?
I think it depends on the application. What seems like the obvious application is building an AI that models human beliefs, or human preferences. What are some of the desiderata we use when choosing how we want an AI to model us, and how do these compare to typical desiderata used in picking model classes for agents?
I like Savage, so I'll pick on him. Before you even get into what he considers the "real" desiderata, he wants to say that there's a set of actions which are functions from states to consequences, and this set is closed under the operation of using one action for some arbitrary states and another action for the rest. But humans very don't work that way - I'd want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking.
Or if we're thinking about modeling humans to extract the "preferences" part of the model: Suppose Person A wants to get out a function that ranks actions, while Person B wants to learn a utility function, its domain of validity, and a custom world-model that the utility function lives in. What's the model for how something like a selection theorem will help them resolve their differences?
You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.
In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.
We can imagine modeling humans in purely psychological ways with no biological inspiration, so I think you're saying that you want to look at the "natural constraints" on representations / processes, and then in a sense generalize or over-charge those constraints to narrow down model choices?
Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
Right. I think I'm more of the opinion that we'll end up choosing those interfaces via desiderata that apply more directly to the interface (like "we want to be able to compare two models' ratings of the same possible future"), rather than indirect desiderata on "how a practical agent should look" that we keep adding to until an interface pops out.
The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.
An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful.
More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.
This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled "world model" in the code, and others are labeled "preferences", and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn't need to respect our preconceptions. What the model really "wants" to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we're trying to force on it.
For this problem, which might not be what you're talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it's an inference method that trades off between model complexity and error in a different way.
(P.S.: I think there is an "obvious" way to do that, and it's MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)
Just posted an analysis of the epistemic strategies underlying selection theorems and their applications. Might be interesting for people who want to go further with selection theorem, either by proving one or by critiquing one.
Interesting. Selection theorems seem like a way of identifying the purposes or source of goal directness in agents that seems obvious to us yet hard to pin down. Compare also the ground of optimization.