Review

Why Subagents?

24johnswentworth

17Scott Garrabrant

13Thomas Kwa

7Morgan Rogers

6Oliver Habryka

5Daniel Kokotajlo

9johnswentworth

5Kaj Sotala

4romeostevensit

5johnswentworth

3Raymond Arnold

5Ben Pace

New Comment

12 comments, sorted by Click to highlight new comments since: Today at 8:26 PM

The type signature of goals is the overarching topic to which this post contributes. It can manifest in a lot of different ways in specific applications:

- What's the type signature of human values?
- What structure types should systems biologists or microscope AI researchers look for in supposedly-goal-oriented biological or ML systems?
- Will AI be "goal-oriented", and what would be the type signature of its "goal"?

If we want to "align AI with human values", build ML interpretability tools, etc, then that's going to be pretty tough if we don't even know what-kind-of-thing we're looking for. When we don't know what-kind-of-things we're talking about, our analysis risks being completely confused - like trying to subtract 3 from "cookie", or measure the angular momentum of life satisfaction.

The traditional go-to answer for these type signatures is "expected utility over world-states": we have a "utility function" mapping world-states (inputs) to real numbers (outputs), and we average utility values over some distribution on world-states.

This go-to answer feels confused, in multiple ways, both in theory and in practice. On the theory side, this review outlines some problems. On the practice side, "Why Subagents?" mentions that markets are not expected utility maximizers, Kaj's sequence talks about many of the ways in which humans seem to be made of subagents rather than one monolithic utility maximizer, and of course there are also other problems which aren't about subagents.

What are the inputs to human values, and what are the outputs? That's a reasonable formulation of the type-signature question, in the context of human values.

I consider the "utility function on world-states" answer confused on both parts of the question - inputs and outputs. Subagents address half of that problem: the outputs half. "Why Subagents?" argues that the outputs should be, not one real number, but a set of real numbers, representing the utilities of subagents.

(The other half of the problem - inputs to values - I consider harder, and it's the inputs part which is involved in the most "conceptually difficult" problems of alignment. My current best answer is that the inputs to human values are latent variables in a human's world-model. This provides a clean and intuitive formalization of hairy conceptual/philosophical problems in alignment; see here for more on that.)

There's still a lot of dangling threads. On the theory side, "Why Subagents?" only talks about deterministic preferences, which is dramatically easier than the probabilistic version we really care about. Ideally, we'd like a coherence theorem. On the empirical side, we'd really like to know if there are subagents embedded in e.g. trained neural networks or bacteria. These empirical investigations will eventually be the real test, but we probably need more work on theory side before we're ready for the empirical component. Subagents are only half of the type signature, and it's a coherence theorem (or something analogous) which would tell us *how* to look for these structures embedded in real-world systems.

Not sure if you've seen it, but this paper by Critch and Russell might be relevant when you start thinking about uncertainty.

Note that the particular form of "nonexistence of a representative agent" John mentions is an original result that's not too difficult to show informally, but hasn't really been written down formally either here or in the economics literature.

Ryan Kidd and I did an economics literature review a few weeks ago for representative agent stuff, and couldn't find any results general enough to be meaningful. We did find one paper that proved a market's utility function couldn't be of a certain restricted form, but nothing about proving the lack of a coherent utility function in general. A bounty also hasn't found any such papers.

The example you give has a pretty simple lattice of preferences, which lends itself to illustrations but which might create some misconceptions about how the subagent model should be formalized. For example, in your example you assume that the agents' preferences are orthogonal (one cares about pepperoni, the other about mushrooms, and each is indifferent to the opposite direction), the agents have equal weighting in the decision-making, the lattice is distributive... Compensating for these factors, there are many ways that a given 'weak utility' can be expressed in terms of subagents. I'm sure there are optimization questions that follow here, about the minimum number of subagents (dimensions) needed to embed a given weak-utility function (partially ordered set), and about when reasonable constraints such as orthogonality of subagents can be imposed. There are also composition questions: how does a committee of agents with subagents behave?

This post felt like it took a problem that I was thinking about from 3 different perspectives and combined them in a way that felt pretty coherent, though I am fully sure how right it gets it. Concretely, the 3 domains I felt it touched on were:

- How much can you model human minds as consistent of subagents?
- How much can problems with coherence theorems be addressed by modeling things as subagents?
- How much will AI systems behave like consisting of multiple subagents?

All three of these feel pretty important to me.

Thanks, this is really cool!

I'm a bit concerned about this sort of thing: "The subagents argument offers a theoretical basis for the idea that humans have lots of internal subagents, with competing wants and needs, all constantly negotiating with each other to decide on externally-visible behavior."

A worry I have about the standard representation theorems is that they prove too much; if everything can be represented as having a utility function, then maybe it's not so useful to talk about utility functions. Similarly now I worry: I thought when people talked about subagent theories of mind, they meant something substantial by this--not merely that the mind has incomplete (though still acyclic) preferences!

Not sure if you've ever taken a class on electricity & magnetism, but one of the central notions is the conservative vector field - electric fields being the standard example. You take an electron, and drag it around the electric field. Sometimes you'll have to push it against the field, sometimes the field will push it along for you. You add up all the energy spent pushing (or energy extracted when the field pushes it for you), and find an interesting result: the energy spent moving the electron from point A to point B is completely independent of the path taken. Any two paths from A to B will require exactly the same energy expenditure.

That's a pretty serious constraint on the field - the vast majority of possible vector fields are not conservative.

It's also exactly the same constraint as a utility function: a vector field is conservative if-and-only-if it is acyclic, in the sense of having zero circulation around any closed curve. Indeed, this means that conservative vector fields can be viewed as utility functions: the field itself is the gradient of a "utility function" (called the potential field), and it accepts any local "trade" which increases utility - i.e. moving an electron up the gradient of the utility function. Conversely, if we have preferences represented by local preferences in a (finite-dimensional) vector space, then we can summarize those preferences with a utility function if-and-only-if the field is conservative.

My point is: acyclicity is a major constraint on a system's behavior. It is definitely not the case that "everything can be represented as having a utility function".

Now, there is a separate piece to your concern: when people talk about subagent theories of mind, they think that the brain is actually *implemented* using subagents, not merely behaving in a manner equivalent to having subagents. It's a variant of the behavior vs architecture question. In this case, we can partially answer the question: subagent architectures have a relative advantage over *most* non-subagent architectures in that the subagent architectures won't throw away resources via cyclic preferences, whereas *most* of the non-subagent architectures will. The only non-subagent architectures which don't throw away resources are those whose behavior just so happens to be equivalent to subagents.

If a system with a subagent architecture is evolving, then it will mostly be exploring different configurations of subagents - so any configuration it explores will at least not throw away resources. On the other hand, with a non-subagent architecture, we'd expect that there's some surface in configuration space which happens to implement agent-like behavior, and any changes which move off that surface will throw away at least some resources - and any single-nucleotide change is likely to move off the surface. In other words, a subagent architecture is more likely to have a nice evolutionary path from wherever it starts to the maximum-fitness design, whereas a non-subagent architecture may not have such a smooth path. As an evolutionary analogue to the behavior vs architecture question, I'd conjecture: subagent-like behavior generally won't evolve without subagent-like architecture, because it's so much easier to explore efficient designs within a subagent architecture.

The "many decisions can be thought of as a committee requiring unanimous agreement" model felt intuitively right to me, and afterwards I've observed myself behaving in ways which seem compatible with it, and thought of this post.

One thing I don't understand about cycles is that they seem fine as long as you have a generalized cycle detector and a single instance of a cycle getting generated is fine because the losses from one (or a few) rounds is small. I guess people think of utility functions as fixed normally, but this sort of rolls in fixed point/convergence intuitions into the problem formulation.

One frame is that utility functions as a formalism are just an extension of the great rationality debate.

If we have a cycle detector which prevents cycling, then we don't have true cycles. Indeed, that would be an example of a system with internal state: the externally-visible state looks like it cycles, but the full state never does - the state of the cycle detector changes.

So this post, as applied to cycle-detectors, says: any system which detects cycles and prevents further cycling can be represented by a committee of utility-maximizing agents.

The justification for modelling real-world systems as “agents” - i.e. choosing actions to maximize some utility function - usually rests on

various coherence theorems. They say things like “either the system’s behavior maximizes some utility function, or it is throwing away resources” or “either the system’s behavior maximizes some utility function, or it can be exploited” or things like that. Different theorems use slightly different assumptions and prove slightly different things, e.g. deterministic vs probabilistic utility function, unique vs non-unique utility function, whether the agent can ignore a possible action, etc.One theme in these theorems is how they handle “incomplete preferences”: situations where an agent does not prefer one world-state over another. For instance, imagine an agent which prefers pepperoni over mushroom pizza when it has pepperoni, but mushroom over pepperoni when it has mushroom; it’s simply never willing to trade in either direction. There’s nothing inherently “wrong” with this; the agent is not necessarily executing a dominated strategy, cannot necessarily be exploited, or any of the other bad things we associate with inconsistent preferences. But the preferences can’t be described by a utility function over pizza toppings.

In this post, we’ll see that these kinds of preferences are very naturally described using subagents. In particular, when preferences are allowed to be path-dependent, subagents are important for representing consistent preferences. This gives a theoretical grounding for multi-agent models of human cognition.

## Preference Representation and Weak Utility

Let’s expand our pizza example. We’ll consider an agent who:

We can represent this using a directed graph:

The arrows show preference: our agent prefers B over A if (and only if) there is a directed path from A to B along the arrows. There is no path from pepperoni to mushroom or from mushroom to pepperoni, so the agent has no preference between them. In this case, we’re interpreting “no preference” as “agent prefers to keep whatever they have already”. Note that this is NOT the same as “the agent is indifferent”, in which case the agent is willing to switch back and forth between the two options as long as the switch doesn’t cost anything.

Key point: there is no cycle in this graph. If the agent’s preferences are cyclic, that’s when they provably throw away resources, paying to go in circles. As long as the preferences are acyclic, we call them “consistent”.

Now, at this point we can still define a “weak” utility function by ignoring the “missing” preference between pepperoni and mushroom. Here’s the idea: a normal utility function says “the agent always prefers the option with higher utility”. A weak utility function says: “

ifthe agent has a preference, then they always prefer the option with higher utility”. The missing preference means we can’t build a normal utility function, but we can still build a weak utility function. Here’s how: since our graph has no cycles, we can always order the nodes so that the arrows only go forward along the sorted nodes - a technique calledtopological sorting. Each node’s position in the topological sort order is its utility. A small tweak to this method also handles indifference.(Note: I’m using the term “weak utility” here because it seems natural; I don’t know of any standard term for this in the literature. Most people don’t distinguish between these two interpretations of utility.)

When preferences are incomplete, there are multiple possible weak utility functions. For instance, in our example, the topological sort order shown above gives pepperoni utility 1 and mushroom utility 2. But we could just as easily swap them!

## Preference By Committee

The problem with the weak utility approach is that it treats the preference between pepperoni and mushroom as unknown - depending on which possible utility we pick, it could go either way. It’s pretending that there’s some hidden preference there which we simply don’t know. But there are real systems where the preference is not merely unknown, but a real preference to stay in the current state.

For example, maybe our pizza-agent is actually a committee which must unanimously agree to any proposed change. One member prefers pepperoni to no pepperoni, regardless of mushrooms; the other prefers mushrooms to no mushrooms, regardless of pepperoni. This committee is not exploitable and does not throw away resources, nor does it have any hidden preference between pepperoni and mushrooms. Viewed as a black box, its “true” preference between pepperoni and mushrooms is to keep whichever it currently has.

In fact, it turns out that we can represent

anyconsistent preferences by a committee requiring unanimous agreement.The key idea here is called

order dimension. We want to take our directed acyclic graph of preferences, and stick it into a multidimensional space so that there is an arrow from A to B if-and-only-if B is higher alongalldimensions. Each dimension represents the utility of one subagent on the committee; that subagent approves a change only if the change does not decrease the subagent’s utility. In order for the whole committee to approve a change, the trade must increase (or leave unchanged) the utilities of all subagents. The minimum number of agents required to make this work - the minimum number of dimensions required - is the order dimension of the graph.For instance, our pizza example has order dimension 2. We can draw it in a 2-dimensional space like this:

Note that, if there are infinitely many possibilities, then the order dimension can be infinite - we may need infinitely many agents to represent some preferences. But as long as the possibilities are finite, the order dimension will be as well.

## Path-Dependence

So far, we’ve interpreted “missing” preferences as “agent prefers to stay in current state”. One important reason for that interpretation is that it’s exactly what we need in order to handle path-dependent preferences.

In practice, path-dependent preferences mostly matter for systems with “hidden state”: internal variables which can change in response to the system’s choices. A great example of this is financial markets: they’re the ur-example of efficiency and inexploitability, yet it turns out that a market does not have a utility function in general (economists call this “

nonexistence of a representative agent”). The reason is that the distribution of wealth across the market’s agents functions as an internal hidden variable. Depending on what path the market follows, different internal agents end up with different amounts of wealth, and the market as a whole will hold different portfolios as a result - even if the externally-visible variables, i.e. prices, end up the same.Most path-dependence results from some hidden state directly, but even if we don’t know the hidden state, we can always add hidden state in order to model path-dependence. Whenever future preferences differ based on how the system reached the current state, we just split the state into two states - one for each possibility. Then we repeat, until we have a full set of states with path-independent preferences between them. These new states are “full” states of the system; from outside, some of them look the same.

An example: suppose I prefer New York to Boston if I just came from DC, but Boston to New York if I just came from Philadelphia.

We can represent that with hidden state:

We now have two separate hidden internal nodes, which both correspond to the same externally-visible state “New York”.

Now the key piece: there is no way to get to the “New York (from Philly)” node directly from the “New York (from DC)” node. The agent does not, and cannot, have a preference between these two nodes. Analogously, a market cannot have a preference between two different wealth distributions - the subagents who comprise a market will never spontaneously decide to redistribute their wealth amongst themselves. They always “prefer” (or “decide”) to stay in whatever state they’re currently in.

This is why we need to understand incomplete preferences in order to handle path-dependent preferences: hidden state creates situations where the agent “prefers” to stay in whatever state they’re in.

Now we can easily model the system using subagents exactly as we did for incomplete preferences. We have a directed preference graph between full states (including hidden state), it needs to be acyclic to avoid throwing away resources, so we can find a set of subagents to represent the preferences. In the case of a market, this is just the subagents which comprise the market: they’ll take a trade if it does not decrease the utility of any subagent. (Note, however, that the same externally-visible trade can correspond to multiple possible internal state changes; the subagents will take the trade if any of the possible internal state changes are non-utility-decreasing for all of them. For a market, this means they can trade amongst themselves in response to the external trade in order to make everyone happy.)

## Applications & Speculations

We’ve just argued that a system with consistent preferences can be modelled as a committee of utility-maximizing agents. How does this change our interpretation and predictions of the world?

First and foremost: the subagents argument is a generalization of the standard acyclic preferences argument. Anytime we might want to use the acyclic preferences argument, but there’s no reason for the system to be path-independent, we can apply the subagents argument instead. In practice, we usually expect systems to be efficient/inexploitable because of some selection pressure (evolution, market competition, etc) - and that selection pressure usually doesn’t care about path dependence in and of itself.

Main takeaway: pretty much anywhere we’d use an agent with a utility function to model something, we can apply the subagents argument and use a committee of agents with utility functions instead. In particular, this is a good replacement for "weak" utility functions.

Humans are a particularly interesting example. We’d normally use the acyclic preferences argument (among other arguments) to argue that humans approximate utility-maximizers in most situations. But there’s no particular reason to assume path-independence; indeed, human behavior

looks highly path-dependent. So, apply the subagents argument. Hypothesis: human behavior approximates the choices of a committee of utility-maximizing agents in most situations.Sound familiar? The subagents argument offers a theoretical basis for the idea that humans have lots of internal subagents, with competing wants and needs, all constantly negotiating with each other to decide on externally-visible behavior.In principle, we could test this hypothesis more rigorously. Lots of people think of AI “learning what humans want” by asking questions or offering choices or running simulations. Personally, I picture an AI taking in a scan of a full human connectome, then directly calculating the embedded preferences. Someday, this will be possible. When the AI solves those equations, do we expect it to find a single generic optimizer embedded in the system, approximately optimizing some “utility”? Or do we expect to find a bunch of separate generic optimizers, approximately optimizing several different “utilities”, and negotiating with each other? Probably neither picture is complete yet, but I’d bet the second is much closer to reality.

## Conclusion

Let’s recap:

anysystem with deterministic, efficient/inexploitable preferences can be represented by a committee of utility-maximizing agents - even if the system has path-dependent or incomplete preferences.One big piece which we haven’t touched at all is uncertainty. An obvious generalization of the subagents argument is that, once we add uncertainty (and a notion of efficiency/inexploitability which accounts for it), an efficient/inexploitable path-dependent system can be represented by a committee of Bayesian utility maximizers. I haven’t even started to tackle that conjecture yet; it’s a wide-open problem.