This post examines the epistemic strategies of John Wentworth’s selection theorem posts.
(If you want to skim this post, just read the Summary subsection that display the different epistemic strategies as design patterns)
I introduced the concept in a recent post, but didn’t define them except as the “ways of producing” knowledge that are used in a piece of research. If we consider a post or paper as a computer program outputting (producing) knowledge about alignment, epistemic strategies are the underlying algorithm or, even more abstractly, the design patterns.
An example of epistemic strategy, common in natural sciences (and beyond), is
More than just laying out some abstract recipe, analysis serves to understand how each step is done, whether that makes sense, and how each step (and the whole strategy) might fail. Just like a design pattern or an algorithm, it matters tremendously to know when to apply it and when to avoid it as well as subtleties to be aware of.
Laying this underlying structure bare matters in three ways:
Now, before starting, I need to point out that the selection theorems posts don’t present research results; they present epistemic strategies (the eponymous theorems). Does that mean my job has already been done? Not exactly: John’s posts do present that epistemic strategy, but not in all the ways I want to stress out. John is also trying to fill in a lot of concrete details and to convince people that selection theorems are a nice thing to research, which I don’t have to do. Instead, you can see this post as distilling the structure of selection theorems and interrogating them further as ways of producing knowledge.
(I use the word “agent” to stay coherent with John, but nothing in the epistemic strategy itself requires agency, and so finding the idea of agents confusing shouldn’t be an issue for reading this post)
Thanks to John Wentworth for feedback on a draft of this post.
Selection theorems are theorems. Obviously. But what sort of theorems? What are they trying to find about the world?
John summarizes the whole class of results in the following way:
Roughly speaking, a Selection Theorem tells us something about what agent type signatures will be selected for (by e.g. natural selection or ML training or economic profitability) in some broad class of environments.
This gives us three components of a selection theorem: the selection pressure, the class of environments considered and the constraint on the agent (what John calls the “type signatures”). Let’s get into each, looking for what can fill the corresponding hole in the general selection theorem.
A selection theorem is first and foremost about selection. Not just selection mechanisms (low-level processes like natural selection) but also selection criteria (abstract conditions like no Dutch-booking). The former state how selection happens, whereas the other just characterize the sort of things that will be selected.
One of the differences is that a selection mechanism implies a selection criterion, either implicitly (natural selection) or explicitly (ML training with an actual loss function); whereas a selection criterion doesn’t necessarily come with a mechanism.
Still, both mechanisms and criteria come in a wide variety -- what makes a good one for a selection theorem? Making the selection theorem applicable to the real world situation we care about. The next section focuses on this topic of application, but in summary: mechanisms must fit actual selection processes in the situation, whereas criteria must come with an explanation of why they would be instantiated (possibly a corresponding selection mechanism, but not necessarily).
It’s also less obvious what makes a “good” criterion, because of the risk to assume the constraint we want to show in the selection criterion itself.
I find John’s formulation above unfortunate, because it doesn’t stress enough how the “broad class of environment” is part of the hypothesis of a selection theorem, not the conclusion. The intuition here is that we need enough variety to instantiate the selection pressure or criterion. Selection let’s you force the agent’s hand, but only if you can instantiate the situations you need.
For a selection mechanism, this amounts to containing the sort of situations where the mechanism will push in the right direction and be strong enough (for example predation pushing natural selection forward). For a selection criterion, it is about including the situations that take advantage of every suboptimality in the agent (like the exploitative bets punishing suboptimality in no Dutch-Booking)
Note though that while a broader class of environments might be necessary for proving the theorems, it makes applying it more difficult by putting more conditions on the environments in the real world setting. There is thus a trade-off between making it possible to prove the theorem (more environments) and making it possible to apply it (less environments). We thus want as small a set of environments as possible while still being large enough to leverage the selection.
In the original post, John takes pains to split agents’ type signatures into different components and to explain how they interact with each other. At the level we’re seeing stuff though, we only need to understand that type signatures are necessary conditions on the agents coming from the selection: if an agent is to be selected, it must do X (or do X with high probability).
What sort of conditions do selection theorems show? Here we have a discrepancy between what selection theorems historically prove and what John wants to get out of them. Existing selection theorems only prove behavioral necessary conditions: you must act like this (as in coherence theorems) or you must be able to do that (as in the Gooder Regulator theorem). On the other hand, what we truly want are structural necessary conditions — for example “you must have a separate world model with this interface and these properties”. John’s third post on selection theorems is all about how he wants that.
Indeed, structural constraints tell you not only that the system must solve the problem, but how it will do so. Alignment just becomes easier if we have knowledge of the internal structure of the system: we can make more pointed predictions about how it might be unaligned; we might use this structure for more concrete alignment schemes. Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.
I’ll go into more detail about proving structural constraints in the next section, but for the moment just note that this is the sort of thing we want.
Selection theorems thus have the following structure:
Existing selection theorems only prove behavioral constraints — that is, they only show that the agents must be behaviorally equivalent to a specific class (like EU maximizers in coherence theorems) or that they must be able solve a specific problem (remembering all relevant data in the Gooder Regulator theorem).
How to prove selection theorems for behavioral constraints? Looking at the existing theorems, the first thing to notice is that they tend to use selection criteria. It makes sense, as they tend to be proved backwards: looking at the necessary condition on agents, what criterion selects only agents behaving like this?
It doesn’t mean such theorems are trivial or useless, just that they tell us which criterion selects for the necessary condition, not what is selected by some selection pressure.
Here instead of criteria, mechanisms are favored. This is mostly because we want to show that some process (natural selection and/or ML training) leads to structural constraints, not find criteria for structural constraints.
Note that we should expect any mechanism to find some good ad-hoc agent without the structural properties; selection theorems for structural constraints can thus only give probabilistic guarantees. They say “out of the agents favored by this selection mechanism, most/almost all will have these structural properties”.
Here are some epistemic strategies to argue that the typical agents selected by a selection theorem on behavior alone should in expectation have additional structural constraints. The list isn’t exhaustive, and I expect these strategies to be combined when actually arguing for such structural constraints.
Proving selection theorems use the following epistemic strategies:
Even pure mathematicians don’t prove theorems only for the joy of the proof: the value of a theorem often comes from what it shows and where it can be applied. The same holds in alignment, with the additional difficulty that we want to apply it to real world systems and situations, not only to other abstractions. This means we need to understand when we can apply selection theorems and what we can learn from that application.
First thing first: selection theorems require the existence of selection. Once again quite obvious, but it becomes more interesting if we dig into the subtleties.
How to argue for the existence of selection depends on whether the theorem uses a mechanism or a criterion.
The other requirement lies on environments. Not only do we need the variety of environments over which selection is taking place, but environments also need to fit the mold assumed in selection theorems. Coherence theorems for example require that bets can be defined in the environments with the required properties, and that the space of bets considered contains the dutch-booking strategies for any suboptimal policy. The Gooder Regulator Theorem has more concrete requirements in terms of the underlying causal structure, and the same sort of variety constraint on the “tasks” that the agent has to solve.
Once we are confident the selection theorem applies in our concrete setting, we can reap its fruits. But what are those fruits? At first glance, they’re obvious: the necessary conditions stated in the theorem! Yet anyone who ever applied a theorem to a real world setting knows how perilous that task is.
How do you make sense of the necessary conditions in your setting for example? You need to find a way of grounding the constraints on agents you get out of the theorem.
This is where most applications of theorems to real world settings go wrong, in my opinion. Yet this is also the part I have the least to say about, because I just don’t have some nice epistemic strategy to check that some conclusions taken from applying a theorem to situation S actually make sense. I’ve seen people do that move, I’ve made it myself, but I don’t have a nice description of the underlying algorithm. So let’s flag that as an open problem for the time being.
Last but not least, analyzing an epistemic strategy tells us where it can go wrong. The analogy to think here is of falsification: this is a standard and strong way of trying to break a scientific model. What does that look like for selection theorems?
Let’s use the summary design patterns of the previous section, and for each one, finding issues/criticisms/ways of breaking that step.
Lastly, in addition to criticizing a specific application of the theorem, we might argue that the theorem cannot be applied to the wanted setting, or that it doesn’t make sense to conclude what is wanted from it. This amounts to the points above, with the twist of arguing that it’s impossible instead of just breaking the argument at some joint.
This obviously fails to list all possible ways of critiquing a selection theorem and its application. You might have noted that I didn’t say anything about interpreting the necessary condition once the theorem is applied; indeed, without understanding the epistemic strategy involved, it’s harder to get to the core.
Still, any criticism and feedback along these lines would be directly useful to the researcher (John or someone else) proposing a new selection theorem and/or applying one. My claim is that using the design pattern above helps in providing feedback, by drawing attention to the most important parts of the epistemic strategies involved.
Two things I'd especially like to highlight in this post:
Fundamentally, structural constraints give us back some of the guarantees of the main epistemic strategies of Science and Engineering that get lost in alignment: we don’t have the technology yet, but we have some ideas of how it will work.
This is possibly the best one-sentence summary I've seen of how these sorts of theorems would be useful.
One corollary of recovering (some of) the usual science-and-engineering strategies is that selection theorems would open the door to a lot of empirical work on alignment and agency. Thus the importance of this section:
Proving structural selection theoremChoose a selection mechanism to investigate.Find a structural constraint that should be favored by the mechanism.Prove the theorem.Show that agents with these structural constraints are easier to find.Show that many agents without the structural constraints can be easily found by the selection pressure.Show that agents with structural constraints are a majority.Show that there isn’t a majority of selected-for agents with structural constraints.Show that agents with structural constraints are easier to sample.Argue that the set of selected-for agents is different that the one used in the work, and that for the actual set, sampling agents without structural constraints becomes simpler.Propose a sampling of agents and show it results in structural constraints with high-probability.Show that the proposed sampling disagrees with what the selection pressure actually finds (showing that the probabilities are different, or that one can sample agents that the other can’t).Checking that the selection theorem appliesCheck that the selection exists.For a mechanism, check that it fits with how selection happens.Show that the actual selection works differently than the mechanism described, and that these differences influence massively what is selected in the end.
These are all potential ways to empirically test various kinds of selection theorems.