Two things I'd especially like to highlight in this post:
Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.
This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.
One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on alignment and agency. Thus the importance of this section:
Proving structural selection theoremChoose a selection mechanism to investigate.Find a structural constraint that should be favored by the mechanism.Prove the theorem.Show that agents with these structural constraints are easier to find.Show that many agents without the structural constraints can be easily found by the selection pressure.Show that agents with structural constraints are a majority.Show that there isn’t a majority of selected-for agents with structural constraints.Show that agents with structural constraints are easier to sample.Argue that the set of selected-for agents is different that the one used in the work, and that for the actual set, sampling agents without structural constraints becomes simpler.Propose a sampling of agents and show it results in structural constraints with high-probability.Show that the proposed sampling disagrees with what the selection pressure actually finds (showing that the probabilities are different, or that one can sample agents that the other can’t).Checking that the selection theorem appliesCheck that the selection exists.For a mechanism, check that it fits with how selection happens.Show that the actual selection works differently than the mechanism described, and that these differences influence massively what is selected in the end.
These are all potential ways to empirically test various kinds of selection theorems.
Cool, looks good.
I think that's a reasonable summary as written. Two minor quibbles, which you are welcome to ignore:
Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning values by observing human behavior
I agree with the literal content of this sentence, but I personally don't imagine limiting it to behavioral data. I expect embedding-relevant selection theorems, which would also open the door to using internal structure or low-level dynamics of the brain to learn values (and human models, precision of approximations, etc).
Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don’t waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility.
Agents selected by ML (e.g. RL training on games) also often have internal state.
The biggest piece (IMO) would be figuring out key properties of human values. If we look at e.g. your sequence on value learning, the main takeaway of the section on ambitious value learning is "we would need more assumptions". (I would also argue we need different assumptions, because some of the currently standard assumptions are wrong - like utility functions.)
That's one thing selection theorems offer: a well-grounded basis for new assumptions for ambitious value learning. (And, as an added bonus, directly bringing selection into the picture means we also have an angle for characterizing how much precision to expect from any approximations.) I consider this the current main bottleneck to progress on outer alignment: we don't even understand what kind-of-thing we're trying to align AI with.
(Side-note: this is also the main value which I think the Natural Abstraction Hypothesis offers: it directly tackles the Pointers Problem, and tells us what the "input variables" are for human values.)
Taking a different angle: if we're concerned about malign inner agents, then selection theorems would potentially offer both (1) tools for characterizing selection pressures under which agents are likely to arise (and what goals/world models those agents are likely to have), and (2) ways to look for inner agents by looking directly at the internals of the trained systems. I consider our inability to do (2) in any robust, generalizable way to be the current main bottleneck to progress on inner alignment: we don't even understand what kind-of-thing we're supposed to look for.
A few comments...
Selection theorems are helpful because they tell us likely properties of the agents we build.
What are selection theorems helpful for? Three possible areas (not necessarily comprehensive):
Of these, I expect the first to be most important, followed by the last, although this depends on the relative difficulty one expects from inner vs outer alignment, as well as the path-to-AGI.
(What does it mean to be non-dominated? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose money.)
"Non-dominated" is always (to my knowledge) synonymous with "Pareto optimal", same as the usage in game theory. It varies only to the extent that "pareto optimality of what?" varies; in the case of coherence theorems, it's Pareto optimality with respect to a single utility function over multiple worlds. (Ruling out Dutch books is downstream of that: a Dutch book is a Pareto loss for the agent.)
If you combine this with the very reasonable assumption that we will tend to build non-dominated agents, then we can conclude that we select for agents that can be represented as maximizing expected utility.
... I mean, that's a valid argument, though kinda misses the (IMO) more interesting use-cases, like e.g. "if evolution selects for non-dominated agents, then we conclude that evolution selects for agents that can be represented as maximizing expected utility, and therefore humans are selected for maximizing expected utility". Humans fail to have a utility function not because that argument is wrong, but because the implicit assumptions in the existing coherence theorems are too strong to apply to humans. But this is the sort of argument I hope/expect will work for better selection theorems.
(Also, I would like to emphasize here that I think the current coherence theorems have major problems in their implicit assumptions, and these problems are the main reason they fail for real-world agents, especially humans.)
The problem with that sort of approach is that the system (i.e. agent) being modeled is not necessarily going to play along with whatever desiderata we want. We can't just be like "I want an interface which does X"; if X is not a natural fit for the system, then what pops out will be very misleading/confusing/antihelpful.
An oversimplified example: suppose I have some predictive model, and I want an interface which gives me a point estimate and confidence interval/region rather than a full distribution. That only works well if the distribution isn't multimodal in any important way. If it is importantly multimodal, then any point estimate will be very misleading/confusing/antihelpful.
More generally, the take away here is "we don't get to arbitrarily choose the type signature"; that choice is dependent on properties of the system.
Oh excellent, that's a perfect reference for one of the successor posts to this one. You guys do a much better job explaining what agent type signatures are and giving examples and classification, compared to my rather half-baked sketch here.
Basically, yes. Though I would add that narrowing down model choices in some legible way is a necessary step if, for instance, we want to be able to interface with our models in any other way than querying for probabilities over the low-level state of the system.
You want a model of humans to account for complicated, psychology-dependent limitations on what actions we consider taking. So: what process produced this complicated psychology? Natural selection. What data structures can represent that complicated psychology? That's a type signature question. Put the two together, and we have a selection-theorem-shaped question.
In the example with persons A and B: a set of selection theorems would offer a solid foundation for the type signature of human preferences. Most likely, person B would use whatever types the theorems suggest, rather than a utility function, but if for some reason they really wanted a utility function they would probably compute it as an approximation, compute the domain of validity of the approximation, etc. For person A, turning the relevant types into an action-ranking would likely work much the same way that turning e.g. a utility function into an action-ranking works - i.e. just compute the utility (or whatever metrics turn out to be relevant) and sort. Regardless, if extracting preferences, both of them would probably want to work internally with the type signatures suggested by the theorems.
The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.
Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain M1→M2→.... Then in the limit n→∞, fn(Mn) mediates the interaction between M1 and Mn (i.e. the distribution factors according to M1→fn(Mn)→Mn), for some fn satisfying
with probability 1 in the limit.