This post summarizes research conducted under the mentorship of Evan Hubinger, and was assisted by collaboration with Pranav Gade, discussions with Adam Jermyn, and draft feedback from Yonadav Shavit.
Deceptive alignment seems like a real big problem. Especially because we can’t use behavioral incentives to prevent it. One alternative to behavioral incentives is regularization, AKA mechanistic priors - we look at the structure of the model to try and figure out if it’s deceptive or not, then penalize models accordingly. In particular, there are some hopes that a speed prior might be anti-deceptive. This is because in the extreme case, the fastest way to do a particular task never involves deception.
This is because it’s just extra steps: spending the extra compute to model your current situation, understand that you are being trained, that you need to protect your goal, and figuring out that you should comply with the training objective for now. All that takes more computation just being inner-aligned and completing the training objective because you want to. The deceptive agent saves on complexity by having a simple value function and reasoning its way to the training objective - the non-deceptive agent does the exact opposite and saves on computation time by just storing the training objective internally.
So what if we formalize “speed” as boolean circuit size, and pick the smallest circuit that performs well on our task? Do we get a guarantee that it’s not deceptive?
Well, no. In short, this is because the fastest search over algorithms does not necessarily find the fastest algorithm. For example, imagine you’re tasked to solve a problem as fast as possible. You take a moment to think about the fastest way out, conducting an inner search over object-level algorithms. Would you sit around and think until you’ve found the fastest possible object-level solution? No! You think a little bit, and basically do the first thing that seems like it’ll work and be pretty fast. If I want to leave the house in a hurry I get moving immediately, I don’t plan out my exact route and try to turn every corner perfectly. The fastest way to search over exit strategies does not yield the fastest exit strategy.
But not all hope is lost! It turns out that speed partially incentivizes simplicity, but it’s probably also the case that simplicity partially incentivizes speed. This is because simple searches will tend to find fast programs, since it’s simpler to run a bunch of programs in parallel rather than memorizing which ones halt in order to do a strictly serial search by description length. Each of these relationships work by a different mechanism, so it’s a serious gloss to frame it all in terms of “incentivizing”, but there certainly is a tangle of relationships between speed and simplicity that seem to go both ways:
So instead of just approaching pure simplicity, we may be able to find a mixed speed/simplicity fixed point somewhere. This gets us the problem statement for the project: find a fixed-point P*, such that P* is a prior where the search that scores best on P* will use P* for its internal search process, instead of something else. Or at least that’s the ideal - some kind of attractor basin would probably also work, even if the fixed point itself is never attained for some reason. Limiting behavior, bounds, guarantees, something like that is what we’re going for here.
Special thanks to Pranav Gade in this section for cranking through this stuff with me <3
As a first attempt to get a handle on the problem, we assumed that all internal searches over algorithms would work by dovetailing. Dovetailing is a very simple way to search over the countably infinite set of Turing machine computations, with pseudocode as follows:
def dovetail(loss, tm_generator):
tm_list = 
for tm in tm_list:
if state.terminated and loss(state) == 0:
This will eventually run any given TM for any given amount of time. A helpful way to visualize it is on a grid:
The x-axis usually lists TM’s in order of increasing description length. In other words, this keeps on iterating through Turing machine computation-space in this triangle expanding out from the origin, considering longer runtimes and more complex programs, until it reaches the right TM-step coordinate to get a TM that returns and meets some search criteria (in our case, zero loss on whatever the training set is).
This demonstrates the core dynamic that seems to make simple search algorithms produce a bias towards both speed and simplicity. The simplest way to enumerate programs is in increasing order by complexity (and indeed, any possible enumeration of programs must eventually trend towards increasing complexity). And the simplest way to evaluate programs is to run them in parallel, since running them in series would require memorizing which ones don’t halt. Note that this argument is mostly illustrative of a particular intuition, and not intended to make any sort of rigorous claim about search algorithms in general.
But we can be more specific than “mixed speed/simplicity bias”. This grid visualization is mostly helpful to us because it makes it apparent that dovetailing is really doing a search according to isoline rank-order, according to a mix of TM description length and runtime. If we assume that our dovetailer accepts the first computation that meets its criteria, then it’s a bit fuzzy to talk about the “prior” that it implements. We could insert randomness in various ways if we wanted. But note that in any application of a forwarding prior, we’re going to want the prior to be heavily weighted towards the algorithms that score best, so we maintain a high probability of keeping our guarantees. So the behavior of dovetailing is similar to how we would want our forwarding prior to behave anyway. Beware - throughout this post I will be flip-flopping back and forth between the speed/simplicity prior formulation, and the runtime/program-size cost formulation, as it suits me. But they’re totally equivalent.
Each “batch” of TM-steps that it considers has equal runtime+description length rank. I say “rank” here because the number a TM gets assigned along the x-axis is not equal to its description length - there are exponentially many possible TM’s of length n, so the ranking of any given TM grows about exponentially with its description length. Ordering of TM’s with the same description length seems pretty arbitrary, and I basically pay no attention to it.
Point is, this is approximately the prior implemented by a dovetail search: it always selects the TM that has the minimum runtime+exp(description length). In this particular implementation it uses description length as the tiebreaker, but it could easily traverse each iso-line the opposite way and use runtime as the tiebreaker.
Finally, we can tune this prior by adjusting the slope of the iso-lines, which we do by adjusting the ratio of how many TM’s the dovetailer adds to its list each iteration, versus how many steps of each TM the dovetailer executes. For example:
So every time we build a dovetailer, it’s going to select the model that scores best according to some prior of the form a*runtime + b*exp(description length).
Now we need to deal with forwarding stuff, which means two levels of search, which means being very careful not to get confused.
To make this analysis tractable, we’ll restrict the model class under consideration to just dovetailers that follow the pseudocode above, allowing variation only in the slope-controlling values a and b. Our goal is to find a prior P* such that the dovetailer that scores best on P* uses P* internally. In the previous section we already identified the prior implemented by a dovetailer. Which means that all we have to do is assess the runtime and description length costs of a dovetailer, and see if we can find a fixed point. Well, how much does a dovetailer cost?
Since we only allow the values of a and b as free variables in our code, then the only difference between the description lengths of various dovetailers is going to be the number of bits to store two integers. But not quite - what we really care about is the ratio of those integers, since the slope is what sets the internal prior. A 2/2 search is practically identical to a 1/1 search, just rougher at the boundary and more expensive to program. So how many bits do you need to store a fraction?
log(a) bits to store a, log(b) bits to store b, and then you save 2*log(gcd) bits by factoring the greatest common divisor of a and b out of both of them, reducing the fraction to simplest form. What does this thing look like?
Pretty gnarly. The plot on the left is the complexity of a/b as a function of a and b, the plot on the right is the complexity as a function of log2(a/b). Note that the plot on the right only includes fractions from 1/1 to 100/100, the same set as in the plot on the left. The roughness of this function is going to make it pretty hard to find anything like a basin of attraction, but at least it’s something we can explicitly write down.
How about the runtime of a dovetailer? How will that vary with a and b?
If the runtime of the dovetailer is dominated by running the TMs under consideration, then the runtime of the whole process is (approximately) just simple geometry!
But here we encounter problems too, because the point where the search terminates isn’t unique. There are an infinite number of terminating points, and any extreme point will sometimes be optimal.
I don’t know how to go from here to a general verdict about dovetailer forwarding. We know how to write down the program size + runtime costs of a given dovetailer slope, and we (kind of) know what kind of results a dovetailer of given slope. But there are two important reasons to think that the forwarding behavior between levels will be very jumpy and unpredictable, rather than any sort of well-behaved basin of attraction. First, the available set of programs that can be found by dovetailing is limited, meaning we cannot support any desired mix of speed and simplicity - and furthermore, the available points will likely change at each level of sub-search.
Of course, this entire analysis is still very toy, and highly dependent on strong assumptions about dovetailing. Further exploration might find that those problems go away when we relax those assumptions.
The next approach I investigated was using priors that enforce some property not just over a program, but also over the sub-programs within it. This tries to attack the thing that makes the Min Circuits proof work - the fact that even if a circuit is minimal, the various sub-circuits within it don’t have to be the minimal circuits “for that task”. If we lived in a world where the fastest way to do X and Y and Z was to take the fastest way to do X, and the fastest way to do Y, and the fastest way to do Z, and paste them together, then the Min Circuits argument wouldn’t be a problem.
So, can we think of any prior with that property? Well, what we’re essentially asking for is a prior that can be maximized greedily, such that composing the ideal solutions for sub-tasks will give you an ideal solution to the whole composition of tasks. And when I think of greedy algorithms, my mind immediately goes to trees. What if we forced our circuit to be a tree? Concretely, this would mean using the same boolean circuit representation as the Min Circuits post, but with the requirement that no boolean gate can have an output that gets sent more than one place (input wires are allowed to be copied as many times as you want though).
If we take the minimum circuit under this tree constraint, then it seems like it has exactly the nice property desired. I don’t know how to prove this very formally, but I can sketch it. Let’s posit that we use the same sort of composition of tasks as in Min Circuits. Then the overall solution circuit-tree will be forced to contain a sub-circuit-tree which computes a solution to each of the tasks in the task set - it can’t do it any other way, since there’s no way to share components across tasks without wiring one output to multiple places. And if any one of these sub-circuit-trees is too large (i.e. not the minimal circuit for getting zero loss on that particular task), then the overall circuit-tree is too large too. This is a contradiction, so enforcing this prior on a task set forwards the prior to all the tasks in that set.
However, this prior also seems like it will have really serious competitiveness problems. First of all, forcing the circuit to be a tree is extremely restrictive - it essentially amounts to implementing your entire program without using any variables, and re-computing every time you want to re-use something. This is the same as asking for your program to be a Boolean expression, and then regularizing on the length of that expression. It may not be quite as prohibitive as that makes it sound, since it also seems quite similar to functional programming (albeit without helpful tools like map and fold). Perhaps in implementation we could use the same measures functional programming systems use to be competitive (clever memoization and caching, I assume?), while still applying a speed prior as if we hadn’t used the speedup? Doing that on a neural network instead of a legible boolean circuit is also a major obstacle, but we’re deep in theory-land right now and I’m mostly concerned with whether this strategy can be made competitive in principle.
A second, potentially more serious difficulty is the effect of this prior on generalization. If we imagine these Boolean circuits as working by evaluating a set of hypothesis programs in parallel, then observing the environment and picking the correct hypotheses output, then a larger hypothesis class is always going to make the tree larger. This is mostly the same tendency to produce lookup tables which comes up for speed-like priors in general. However, the tree restriction may make the lookup table problem more difficult to work around, since it seems to revolve around forbidding any sort of clever programming synergies. After all, even extremely simple Boolean circuits like one-bit adders involve wiring the output of a gate to two inputs.
Next, I tried a different framing of the problem. Why is it that the fastest search process doesn’t return the fastest algorithm? Well, if we sketch a simple breakdown of the runtime of a search process, we get something like this:
The problem is that f* is such a negligible component of r. This seems to happen mainly because f is approximately as large as f*, and n is probably very large. If we can figure out a way to give discounts to the other steps of the process so that f* dominates, then we can just use a regular speed prior and it will forward properly. But how do we do that? The code run by an inner search over algorithms is never explicitly written down anywhere, it’s all computed implicitly. We won’t be able to identify which parts of the computation to count and which ones to discount when we compute a prior, unless we have perfect interpretability.
Well, what if we give the AI safe cheap tools? In particular, the f and l steps seem quite easy to automate, without obvious room for galaxy-brain algorithm speedups. What if we gave our AI access to a box that would take in a program, evaluate that program on a training set that we determine, and return the batched loss? It seems plausible that even a powerful AI model would be incentivized to use this box, because it saves it the cost of having an internal testing set, and because we could apply a discount to all computations inside the box (for the purposes of our prior). Note that it’s crucial that the box only return batched loss on a large training set, so that it can be used to cheaply evaluate the f+l steps, but not the f* step. If all that holds, then we could use this as a speed prior where f* is a much larger portion of r, and encourage forwarding.
Note that to make this robust to multiple levels of internal search, we can’t just make the box free. Otherwise, if the policy passed into the box to be evaluated contains another level of search, it will be conducted with zero penalty for runtime - exactly the opposite of what we want. So to produce consistent properties across arbitrary levels of search, we apply a multiplicative discount at each level. Ultimately, this would hopefully produce a tree of recursive box calls, where at each level the discount applied to the boxed parts of the algorithm means that the overall cost of the algorithm is dominated by f* - the runtime of the object-level algorithm getting executed. In other words, forwarding of a speed prior across arbitrarily many levels of search!
The core thing that the last proposal was trying to do was to get f* to be most of r. In other words, to get the runtime of the level-n+1 program to be most of the cost of the level-n search. But if we have multiple levels of search, then the cost of the level-n+1 program will also be mostly driven by the cost of the level-n+2 program, and so on until the chain of inner searches stops. Which means that ultimately, the cost of the whole program is getting driven by the runtime of the algorithm all the way at the bottom of the chain of searches. This is another way of framing what we want - the real goal of forwarding is ultimately just to put a speed prior on object-level algorithms.
So why don’t we do that directly? What if we found a way to incentivize the search over algorithms to be done explicitly instead of implicitly? That way, we could see the whole tree of sub-programs, and apply our speed prior to the object-level algorithms at the bottom of the tree when evaluating the whole thing.
We train a two-headed model. Head 1 just gives a binary output - this is supposed to let the model decide if it wants to do meta-learning, or if it just wants to take an object-level stab at the problem. If Head 1 outputs a 0, then the output of Head 2 is interpreted as an object-level answer, and the runtime of the program determines the cost. If Head 1 outputs a 1, then we run a second pass of the model with some kind of semi-informative input, and the output of Head 2 from that pass is interpreted as a learning algorithm. Then we feed it the original input, and give it the same meta-or-object choice.
The hope is that this results in a chain of searches outputting code for further searches, until eventually one has M==0 and produces an object level algorithm, whose runtime is then used to evaluate the cost for all of its “ancestors”.
Even if the code being passed from level to level isn’t human-interpretable, it should at least represent real learning algorithms. This is because the cost of doing search over algorithms is n(m+f+l)+f* in runtime, and generator + evaluator in program complexity. When using the box, however, the runtime penalty is reduced to f*, and the complexity penalty is reduced to complexity(f*). The speed component is strictly better, and the complexity component seems like it will usually be better - hence programs that would want to spin up search processes will be incentivized to use the box instead (assuming formatting the program is cheap, which seems safe for hard problems that require big programs). This is especially true if the model is penalized according to a speed-heavy prior.
You will probably notice a glaring problem in the above section - the use of “???” standing in for “some semi-informative input” where I don’t know what input should be fed to the lower-level meta-learners. One approach might be to simply feed in the original input again. However, this produces a situation where if our model chooses to meta-learn, then it can spend arbitrary amounts of time or complexity generating a program for the next level, AND it knows exactly what input that program will get. Hence, an obviously optimal strategy is to output a 0 on Head 1, and then give us an algorithm like “return x” on Head 2, where x is a pre-calculated solution.
Another approach might be to run a second pass of our model once it has chosen to meta-learn, with an uninformative input that has nothing to do with x. This way, if the model chooses to do meta-learning, then it has to design its next-level learning algorithm without foreknowledge of what that algorithm will be fed. However, with no information whatsoever about the input, it hardly makes sense to search over algorithms to begin with. Remember that we’re assuming the way inner search over algorithms arises under a speed prior is because of cluster-like tasks, where it’s faster to spin up a search process specially suited to the specific problem at hand. Ideally, I would like to figure out some way to allow our model to develop a task-specific next-level algorithm, without going so task-specific as to literally write “return x”.
Another strange thing about this approach is that it completely ignores the mechanistic properties of the higher-level algorithms, putting all of the prior’s weight on the object-level. On the one hand, this is exactly what we want. On the other hand, this means that the optimal strategy under this prior likely involves unbounded amounts of program complexity and runtime, which is a bit of a competitiveness problem.
We might attempt to fix this by putting a weak prior on the higher-level algorithms, and just leaving most of the weight on the object-level. However, this runs into a problem too - hypothesis class restrictions. Both the speed and simplicity priors (but especially the speed prior) like hypothesis class restrictions in their models. And if we’re going through multiple levels of meta-learning, then we somehow need to make sure that the top-level program M0 is going to give us an M1 which is going to give us an M2 which … will have a hypothesis class containing high-speed programs. After all, there’s not much use in a simplicity prior over object-level algorithms if the hypothesis class is full of only slow options to choose from.
If the first difficulty can be resolved, this second difficulty might become an interesting relaxation of the original forwarding problem. Instead of finding a forwarding prior that places the fastest algorithms at the top of its preference ordering, we just have to find a chain of models that will not eliminate the fastest algorithms from the hypothesis class of the next level down. And the higher-level models themselves don’t have to be selected according to any speed prior - they just have to be constrained somehow so they’re not literally unbounded. Then we can use the tuple-box approach to just apply a heavy speed prior to the bottom layer ourselves. This is definitely less clean than the fixed-point formulation, but seems potentially easier.
I expect this post to be really only useful to folks interested in thinking further about this specific sub-problem. So I figured to conclude, I would write down some of the intuitions while working on this and other maybe-helpful stuff that didn’t obviously fit in with any of the particular proposals above.
Another framing for what exactly goes wrong with the circuit size prior is just that it only gives us an average-case speed guarantee. Problem is, we really want something that will prevent any small part of the circuit from hosting deception. An average-case guarantee doesn’t save us from small local deviations, and even one of those is enough to open up the possibility of unrecoverable errors.
Another type of speed prior, the circuit depth prior (made integrable by just ignoring gates not connected to any output) has a problem with leaving most of any given circuit unconstrained. If I choose a task set where one task is highly serially expensive to answer, and the others are all easy, then a circuit depth prior will constrain the sub-circuit for the difficult task to be as short as possible. But that’s all - it gives a uniform distribution over all the other sub-circuits which solve their respective task without exceeding the depth limit set by the hard task.
This is a lot like the problem with the circuit size prior, but worse. An average-case guarantee puts limits on every sub-circuit, but leaves too much wiggle room within those limits. But depth gets us a worst-case guarantee, which focuses all its attention on the “critical path” and leaves the rest of the circuit completely unconstrained.
When framed as a Turing machine, the program-successorship framing of the forwarding problem suggests an approach based on fixed-points or attractors. But when framed as a circuit, “successor” programs are just sub-circuits, and the forwarding problem becomes about getting guarantees that apply across all sub-circuits. Still inductive, since circuits are composed of circuits, but an interestingly different framing that I like.
Besides the intuition that a fixed-point prior should exist, can we actually prove it? Let’s try some math.
A prior over algorithms can be represented as a point in some kind of space. The forwarding problem is trying to find a fixed point when mapping this space to itself. There are a bunch of fixed-point theorems out there, which may or may not be applicable to this particular sort of space and mapping. So let’s try and pin down some properties of the space and the mapping, and see if we can find any applicable theorems.
The space of priors over algorithms is the part of Hilbert space where the sum of all the coordinates is 1. This means it’s at least a complete metric space, but it’s not a proper vector space since it doesn't include the origin.
The mapping in question is from the outer prior we apply to the inner prior of the optimally-performing search algorithm on that outer prior. Something like this:
This requires us to come up with a set S of all algorithms that perform search over algorithms, which is a bit of a leaky category, but that doesn’t really seem like a very central problem. The main thing here that I’m worried about is that argmax. I expect this map to be highly discontinuous, as small changes in p can produce a very different argmax(p(m)) if the global maximum switches from one local maximum to another. Without a continuous map, it’s going to be hard to prove that a fixed point exists.
There are a lot of open threads in this project, in the various problems with the proposals I outlined above. But there are also some other open questions that seem maybe fruitful to investigate.
Nice work! And I endorse the objections to handing the AI tools... that doesn't seem to forward well.