This post was written under the mentorship of Evan Hubinger, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Thanks to Evan Hubinger for his mentorship under the SERI MATS program, and to Arun Jose, Rob Ghilduta and Martín Soto for providing prior references and some feedback. Also thanks to James Lucassen for reviewing a draft with me and providing extensive feedback.
Epistemic status: Somewhat uncertain. A lot of the arguments related to usage of the speed prior employ high-level reasoning and lack substantial formalizable proofs. Acceptance of any of the arguments in this post should be conditional upon comfort that they can be reduced to very precise provable statements about properties produced by inductive biases subject to a speed prior as in the described proposals and implementations. It is also not clear how directly mappable these notions of speed priors are to NNs relative to their treatment in generic models of computation (like TMs or circuits).
Introduction
This post examines the current literature on the speed prior before later diving into some new ideas on how to forward the speed prior in instances where mesa-optimization can occur.
The first 2 major sections of this post do not contain original content. In these sections, my contributions consist primarily of distilling and re-examining existing work on the speed prior. In the last 3 sections, I consider some additional new proposals for forwarding the speed prior. I finish with a speculative appendix section on possible ways of employing speed priors for run-time detection of deception, and on using a run-time form of the speed prior to rule out cryptographic cognition within deceptive models. I also include an appendix on a failed approach to forwarding the speed prior.
What is the speed prior and why do we care about it?
The speed prior is a potential technique for combating formation of deceptive alignment. This is achieved by imposing an inductive bias within the training process that makes deceptive formation exceedingly unlikely. This technique rests on the following conjecture:
Conjecture. For any given set of tasks T that do not require deception to solve, a program P that solves the task set with a given performance level L is not deceptive if it is minimal with respect to all programs Pi that solve the task set with performance level L across any of the following family of speed measures:
- Considered as a Turing Machine TP, take the (a) average or (b) maximum number of steps across all inputs {0,1}<ω on which the TM halts.
- Considered as an arithmetic or boolean circuit CP, take the (a) circuit-size in terms of total number of gates or (b) circuit-depth in terms of the largest number of edges in a path from an input to an output gate.
In particular, this conjecture is based on the intuition that “planning how to be deceptive is one step removed from planning how to accomplish the intended goal” of the model. In other words, if we can train models that (with probability close to one) find the minimal viable computation for any given task where minimality is computed according to a speed measure, then we can guarantee they are not deceptive.
We provide a quick analogy for those not familiar with this intuition. Let’s say I need to do some task as per request from my boss. The most aligned possible worker (relative to my boss) would simply want to do the task, and have no thoughts about how the task helps them be evaluated by their boss. On the other hand, if I am thinking about performing the task because my boss asked me to, I can potentially be thinking about questions not just of what to do, but why I am doing it, including the merits of evaluation. In particular, pursuing this idle background thought could lead me to conclusions like: if I do the task for my boss well and get the attention of my boss’s boss, I could eventually be promoted and supplant my boss at his position. If my thinking always focused on the object-level task at hand, that would preclude precursors to actions incompatible with the goals of my boss. This scenario is a rudimentary example of using a speed prior to avert deceptive alignment.
How does the earlier corollary translate to imposing a “speed prior” in practice? In particular, we may take the form of penalties to a loss function such as:
- The total network size of the model in a dynamically growing architecture.
- Some measure of the average number of computations performed by the model in the context of a task, such as the total number of activations in a forward pass across a variety of task instantiations divided by the provided task count.
- A direct estimate of the model’s depth as a circuit.
We briefly examine the literature of existing approaches to characterizing and operationalizing a speed prior before proceeding with some new approaches. Readers that are already familiar with past thinking about the speed prior can skip ahead to the section on “Maze games as toy task models for deceptive alignment.”
Existing approaches to the speed prior
We start by quickly summarizing a variety of reasons originally due to Evan on why the speed prior may be effective or ineffective at addressing deceptive alignment concerns. The interested reader can consult the original post on a full expansion of these ideas.
Reasons to be Optimistic about the Speed Prior
- Claim: In the context of the ELK setup, assume there is a very large model that has learned all of the relevant aspects of a human’s world model and cognition, with the model representable as a large Bayes net. Assume further that the human portion is also represented internally as a human Bayes subnet of the large model net. If we are imposing a speed prior in the context of optimizing a reporter for ELK, then just re-using the model’s net (sans human net) should always be preferred than trying to map from the model net to the human net which will require explicit computation.
- Rebuttal: This claim is contingent on some subtle properties of the speed prior, namely that it is the depth of the circuit that is what is penalized (or TM time complexity when using a Turing Machine model of computation). This means the prior will actually favor a reporter that has the best loss per computation relative to its descriptive complexity, which (depending on the relative complexity between the director translator and the human imitator) is a race that could be won by the human imitator without very careful adjustment, as the direct translator is potentially unbounded.
- Claim: Deceptive alignment requires deceptive cognition in the sense of modeling the intended goal for optimizing the base objective in addition to the proxy goal. Because the latter computation has to run in conjunction with task execution, imposing a sharp speed prior might “squeeze out” the auxiliary cognition necessary for deception.
- Rebuttal: As detailed in “Are minimal circuits deceptive?”, there are strong reasons to believe that at sufficient scales of task variety and model size, meta-learners are selected by both simplicity and speed priors. However, fast meta-learners are predisposed to solving problems by yielding on net the simplest solutions, which is essentially a reduction to a simplicity prior that can in turn restore the risk of deceptive formation.
- Claim: Speed priors might most closely match priors imposed by natural selection via inclusive genetic fitness due to the well-known co-location of local memory with compute in brains and slow serial computation. Because we are solving for base objectives put forth by humans, searching for models with a similar prior to the one that selected humans might yield similar proxy objectives which happen to align closely with the base objective.
- Rebuttal: While there may be a heavy speed prior inherent in selection via inclusive genetic fitness, there might be a ton more imposed by all the bits of information recording other genetic and cultural artifacts that culminate in our value specification in such a way that the right base objectives are still very sparse in the space of all proxy objectives even conditioning upon optimization processes that apply a speed prior.
Reasons to be Pessimistic about the Speed Prior
- Claim: Speed priors incentivize taking shortcuts and may fail badly out of distribution. In particular, they may be anti-correlated with selecting for robust generalization. This is more of an argument about their competitiveness than their efficacy at tackling deception.
- Rebuttal: This might not be an issue for very concrete priors in practice, as simplicity priors are inherent in SGD and measuring actual descriptive complexity is computationally intractable. Thus it is possible some speed penalties could be introduced without impacting competitiveness at generalization.
- Claim: Speed priors are not universal and come in families with corresponding tunable parameters. In practice, it may be hard to find a computable and implementable speed prior. For example, it is known that TM max run steps and average run steps do not yield integrable distributions that could constitute a prior without an additional simplicity term.
- Rebuttal: Use circuit complexity which is integrable! Specifically, the number of logic gates in the circuit is a description length that includes an implicit simplicity component and gives an integrable distribution even when mixed with circuit-type speed priors.
- Claim: The double descent phenomenon shows that models get better then worse on test sets until they hit zero training loss, at which point they get better again. But the top of this interim peak (the interpolation threshold) is a natural place to converge for a speed prior if we equate model size to speed, which is natural given that smaller models take less steps to run.
- Rebuttal: Model size might not be the right proxy for speed here, and instead we might need to look at depth. But scaling laws show that in general many models with equivalent performance exist with varying (average) widths and depths, so a speed-as-depth prior may “flatten” our models without impacting performance but still meaningfully tackle propensity towards deception.
- Claim: The simplicity prior is manifestly better at modeling physics given how compactly simple equations model all of known physics.
- Rebuttal: This claim is not that relevant as the vast majority of interesting tasks that a model would be trained on and need to perform well on for almost any proxy goal are high-level phenomena. In particular, it is highly unlikely that strong models would perform de novo simulation or direct reasoning from physics, and instead favor usage of the kind of cognition that would be favored by speed priors.
- Claim: Speed priors collapse into simplicity priors at sufficiently large model sizes and task variety. Specifically, at sufficient scale the fastest solution that is favored by speed priors may be a search over algorithms, which in turn terminates in solutions that are on average the simplest for the problem at hand.
- Rebuttal: The standard reference on this claim is Are Minimal Circuits Deceptive? In this post, Evan sketches out a proof showcasing that for tasks specifiable as MDPs, if there is any task in the task set on which the model may become deceptive under some solution, then one can construct a task such that the minimal circuit (fastest algorithm in the circuit model of computation) solving it is deceptive. We focus on this problem later in the post by discussing how to “forward the speed prior.”
Other Musings about the Speed Prior
Evan’s arguments in favor and against the speed prior constitute a large base of known considerations. In this section, we review some of the summaries and extensions to these arguments and the attempts to make them operationally precise enough to convert to implementation proposals. Jointly with the previous section, this list constitutes the existing literature on usage of the speed prior in averting deceptive alignment.
- As Marc-Everin Carauleanu notes in his summary “Should we rely on the speed prior for safety?”, speed priors may collapse to simplicity priors:
- “Another way this argument could not hold is if spontaneous meta-learning is constrained by the speed prior. The argument discussed relies on one key fact: the fastest meta-learning algorithms don’t necessarily learn the fastest models.”
- This is indeed the central problem in employing the speed prior, as mentioned in the last claim in the previous section.
- Adam Jermyn similarly notes this in “Training Trace Priors and Speed Priors”:
- “A problem with speed priors is that if they have to search over inner models to run, the fastest way to do that may be to search over *simple* programs, which can lead to inner alignment failure because simple programs are likely deceptive.”
- Paul Colognese writes extensively about the speed prior in “Deception?! I ain’t got time for that!”. His main conclusion is that for speed priors that employ the Turing Machine model of computation, using an average speed prior (in the sense of average number of steps employed per input) may be a lot more robust against deceptive formation than a max speed prior.
- Note that Paul takes the stance of using a slightly more general condition than “deception” by considering any kind of “misbehavior” by programs, including idleness or purposefully pursuing unproductive dead-ends. The motivation for this consideration is similar to that in “LCDT, A Myopic Decision Theory”, where Adam Shimi and Evan point out that deceptive behavior may be very hard to pinpoint and could be more tractable to eradicate through an “overapproximation.” This refers to a simpler to specify condition that strictly dominates the other with probability close to one. In this case, they note that myopia may capture almost all deceptive behavior while still leaving ample room for the kind of generalization capabilities we would want from AGIs.
- To the bolded point above, Paul specifically notes that max speed priors may allow plenty of room for deceptive formation on inputs which do not consume the max compute available. By contrast, using an average speed prior raises the waterline across the board for any “misbehaving” actions/computations.
- After demonstrating that average speed priors are preferable to max speed priors, Paul concludes by implementing several proposals for implementing an average speed prior:
- 0. Perform a direct search that tracks average computation used over the training distribution T, until we find a program close to the argmin{P∈P:P solves T with good performance}κ(P).
- 1. Treating model size as a direct proxy for speed, train a model while slowly increasing the size and attempting to find good performance. As Paul points out, this approach is extremely uncompetitive, but it is also not clear how to smoothly deal with model size increase, as many architectures with varying widths/depths exist for the same model size, so this very much underdetermines the final solution but possibly in a way that is acceptable.
- 2A. The same as Proposal 1 but giving it an output channel on multiple rollouts as a kind of scratchpad. In this setting, the rollouts are meant to act as a mechanism for finding the smallest amount of computations within a given model necessary to get good performance, implementing a kind of max speed prior.
- 2B. The same as Proposal 2A with an auxiliary model trained to predict the minimum number of rollouts needed to achieve good performance.
- 3. Add penalty terms in the loss function to favor early exiting, effectively finding something like the almost-minimal depth for the model dynamically and hence implementing a circuit-type form of average speed prior.
- This proposal in particular seems to be only capable of finding close-to-fastest submodels (in the sense of sub-networks that output in deep latent space rather than terminal output layers), which points more to over-weighting the architecture relative to the task set and does not preclude the possibility of deceptive formation occurring purely through lower layers.
- In general, these seem like performance un-competitive techniques. Proposal 2B is less affected by this consideration but still seems unlikely to solve the problem, as requiring an auxiliary model to predict average rollouts is not a guarantee that models never find deceptive behaviors or develop cross-episode memory mechanisms through the additional computation provided by the scratchpad. In more severe and large-scale instances where such behavior slips through some training steps, the auxiliary model may even be susceptible to gradient hacking by being driven to predict a need for arbitrarily large scratchpads, though this ought to be caught in real-time training process statistics (and cloud compute bills!).
- Paul C also makes the interesting note that memory weakens speed priors due to decreasing the opportunity for myopia and increasing the likelihood of developing cross-episode memory, which can be used for caching information precursors to deception and retrieving it in future episodes. If a model can spread out its computation across episodes, then the effect of applying a heavy speed penalty can change drastically.
- [anonymous] runs a proximal complexity breakdown contrasting three types of mesa-optimisers: one which acts as a deceptive mesa-optimizer that employs naive reasoning about out-of-training checks, another one which employs a more sophisticated fast check, and a third which is intently or corrigibly aligned.
- [Anonymous]’s primary endeavor in the post is to construct a training strategy that consistently favors the third type of mesa-optimizer by employing and steadily increasing rollouts until performance is acceptable, although this seems unlikely to work in practice for sufficiently varied task distributions and model sizes. In particular, this is probably far too compute intensive to be competitive and also hinges on the assumption that the rollout increment is fine-grained enough so as to not mix the first occurrence of multiple mesa-optimizer types. Moreover, performance being acceptable is a fuzzy condition that could similarly admit overlap between multiple mesa-optimizer types without a discrete point where one is identified before the other (for example, if they use subtly different meta-learning procedures that are differentiated by an initialization parameter and that can both be found in one or a few SGD steps starting from an earlier model).
- Note this is conceptually similar to Paul C’s Proposal 2A as it also focuses on slowly increasing the number of rollouts.
- Note that the first type of agent, which compactly represents an out-of-distribution checking predicate while only negligibly increasing its descriptive complexity, was initially described in “Relaxed adversarial training for inner alignment.” That seminal post goes even further, providing some examples of predicates which cannot be detected even by amplified models trained adversarially on procuring predicates that are applied to the potentially deceptive model’s inputs (termed “pseudo-inputs”).
- James Lucassen discusses Attempts at Forwarding Speed Priors.
- In this post, it is acknowledged directly that the central problem in muting deceptiveness within the speed prior is tackling forwarding: ensuring that the fastest algorithm that performs searches itself finds fastest algorithms.
- James tries a toy model of inner searches over algorithms using dovetailing, essentially a form of interpolation in an enumeration of Turing Machines:
- Given an enumeration of TMs, run the first machine for one step.
- Then start a second machine and run both machines for one step.
- Continue indefinitely, adding and starting new machines from the enumeration.
- Alternatively, run the machines forward by K steps and add new machines by batches of L. This parameterization by the quotient K/L gives us a degree of freedom to consider toy models of “searches over algorithms,” but fails to yield tractable insight into forwarding due to “very discontinuous behavior” and the hard problem of finding a fixed prior. By “fixed prior” here we roughly mean that the dovetailer (enumeration) that gets selected by the outer optimization favors fast searches (fastest outer algorithm) which corresponds to dovetailers such that their found solutions are also fastest algorithms (fastest inner algorithms).
- James considers tackling forwarding using minimal circuits by avoiding DAGs within circuits, which run into standard problems, and instead attempts using trees, which are deemed insufficiently expressive for generalization and solving interesting problems.
- He then proceeds with a toy model for “explicit” meta-learning by structuring out a nested approach that unpacks object-level algorithms from algorithms that “choose” via an additional output head whether to descend to an algorithm search. The idea requires explicit architectural work on the model training process to work, but also runs into problems by assuming tasks are implicitly factorable into sub-tasks, which in the formulation eventually descends to meta-learners that potentially squeeze out the hypothesis class of relevant fast object-level algorithms.
- Several older missives on the subject not summarized here including one by Adam Demski and by Paul Christiano.
This section concludes our review of the existing literature on the speed prior.
Deconfusing the speed prior
What does it mean for an algorithm to be “the fastest”? In order to apply an effective speed prior, it seems we need a good operationalizable answer to this question. In this section we focus on deconfusing the notion of a well-defined speed prior by examining various definitions of speed in the context of algorithms.
Throughout the above discussions, we have typically assumed that we can talk about the fastest algorithm for solving a particular problem. Typically this follows directly from standard treatises of run-time complexity of algorithms:
- When modeling algorithms as Turing Machines, we typically consider either the average run-time complexity or worst-case run-time complexity. In the former, we are averaging the steps to halt when running over all possible inputs of length n as the bit-length grows and on which the machine halts (we need a finite number to average over, so we can’t consider all inputs). In the latter, we take the most number of steps executed by the machine on inputs of a fixed length.
- When modeling algorithms as circuits, we typically consider the minimal circuit size or depth. The former is defined as the minimal number of gates in a circuit expressing the algorithm. Depth is defined as the maximum number of edges between an input and output gate, and minimal depth as the circuit that is minimal with respect to that property expressing the algorithm.
However, as we will see, we need to be careful when talking about “the fastest” algorithm, because it does not constitute a unique point in algorithm space unless we are very precise.
Here is a standard example that shows the run-time complexity of an algorithm can depend on what kind of Turing Machine is used.
Consider Turing Machines for recognizing the language A={0k1k|k≥0}, i.e., Turing machines that accept if and only if a string from A is provided as input.
Let M1 be given by: on input string w,
- Scan across the tape and reject if a 0 is found to the right of a 1.
- Repeat as long as some 0s and some 1s remain on the tape:
- Scan across the tape, checking whether the total number of 0s and 1s remaining is even or odd. If it is odd, reject.
- Scan again across the tape, crossing off every other 0 starting with the first 0, and then crossing off every other 1 starting with the first 1.
- If no 0s and no 1s remain on the tape, accept. Otherwise, reject.
Consider the example 013113. This will proceed in four passes of step 2, in each iteration finding that the number of 0s is odd (13), even (6), odd (3) and odd (1). In fact, the sequence of parities interpreted as binary numbers (odd, even, odd, odd => 1101) gives the inverse binary representation of the number of 0s. But step 2a is checking that these parities agree for both 0s and 1s, and hence that their binary representations and therefore that their counts are equal.
The run-time of M1 is O(NlogN): each of the the first and last step are O(N) and one iteration of the middle step is O(N) as all of these require a full pass over the tape. Whereas 2(b) implies step 2 can run at most the number of digits in the input, or 1+log2(N)iterations, for (1+log2(N))O(N)=O(NlogN) total time complexity.
Now consider the same algorithm on a two-tape Turing Machine.
Let M2 be given by: on input string w,
- Scan across the tape and reject if a 0 is found to the right of a 1.
- Scan across the 0s on tape 1 until the first 1. At the same time, copy the 0s onto tape 2.
- Scan across the 1s on tape 1 until the end of the input. For each 1 read on the first tape, cross off a 0 on the second tape. If all 0s are crossed off before all the 1s are read, reject.
- If all the 0s have now been crossed off, accept. If any 0s remain, reject.
Each of the four steps is linear, and hence the run-time of M2 is O(N).
Each of these examples can be shown to be minimal and cannot be improved upon in their respective model of computation. So on single-tape TMs, the computational complexity of recognizing A is O(NlogN), whereas on double-tape TMs it is O(N).
In general, computational complexity theory was not designed to speak meaningfully about the absolute fastest program for solving a problem. Rather than pointing to absolutes (i.e. “the fastest algorithm”) it is equipped for relatives (whether one algorithm is faster relative to both another algorithm and a fixed model of computation). Because run-time complexity must always be considered relative to a given model of computation, there is a standard assumption that all “reasonable” models of computation are polynomially-equivalent – meaning that any of them can be simulated by any other with only a polynomial increase in running time – and is known as the strong Church-Turing Thesis. It is generally unproven except in some specific examples. For example, all TMs can be simulated by Boolean circuits whose size and depth complexity is bounded by a polynomial:
Theorem. Let t:N→N be a function with t(N)≥N. If an algorithm A∈TIME(t(N)), then A has circuit (size or depth) complexity O(t2(N)).
This is only an upper bound, and indeed the proof proceeds by directly simulating any given TM using a family of Boolean circuit CN where each circuit is constructed from a grid of t(N)×t(N) cells where each cell can only define at most a constant number of gates (where the constant is about the number of symbols on the tape alphabet times the number of states). Much more compact circuits can exist for a given machine, but some can be much better than O(t2(N))and others can be equal to Θ(t2(N)) on the nose so that no improvement is possible.
In general, this means that the actual algorithm selected when minimizing directly for TM steps or circuit complexity could be very different depending on which one is penalized (and assuming we had a computable penalization), because the outer objective and training task set can only communicate a finite number of bits through the training process, and so the selected “fastest” algorithms may not be extensionally equal. In particular, both the constants and the polynomial factors matter when selecting a fastest algorithm.
One last note is that even if an algorithm is the fastest solution to a problem in a given setting, say by circuit size, there could by pure happenstance be a completely different circuit (in the sense that it uses different sub-circuits and there is no circuit isomorphism) that produces the same outputs for the same inputs. This could also be the case not just for one given circuit Ci but also for the whole family of circuits {Ci} parameterized by the input bit-length. In the TM setting, there could be a collision between two different TMs with the same average run-time for a given input bit-length.
In other words, employing a speed prior is sensitive to the model of computation, and even under a fixed model the “fastest” algorithm refers to an equivalence class of algorithms that satisfy the minimality with respect to a speed prior. For the rest of the post, we will set aside these concerns and assume the strong conjecture that, assuming a speed prior forwards correctly, it is agnostic to both model of computation and underdetermination by minimality insofar as it is successful in averting deception.
Speed as computational volume
Is there a way to think about a speed prior more mechanistically? In this section we record a useful mental model for thinking about how to map between speed-minimal programs in different models of computation.
Imagine that we have an infinite napkin where we have recorded the trace of execution on every possible input on which an algorithm terminates, and that we have done this for every possible algorithm. This is doable because there are only a countably infinite number of algorithms and because every algorithm is only capable of processing a finite input before it terminates. In other words, the napkin contains every possible computational trace from any possible finite computation. Now we attempt to impose a computational volume measure on the napkin.
What does the contents of our napkin look like? If we represent every computational trace as the steps taken by a Turing Machine using some standard notation, then every trace contained on the napkin will be a sequence of state changes, one for each step taken by the machine. Assign the part of the napkin that represents one step the unit volume. Then the measure of a computational trace is exactly the speed we want to characterize: for a given input X to an algorithm A, we point to the part of the napkin that has the computational trace for A(X) and compute its volume/speed by summing up the steps.
In particular, we can view average and max run-time complexity as integration of the napkin with respect to a partition of the traces for a given algorithm. Let’s say the volume measure is represented by μ. We can then define:
Definition. Consider all finite inputs I(A)={X∈{0,1}<ω} for which an algorithm A terminates. Let P=∐Pi be a partition of I(A) in which each Pi is finite, for example by bit-length of the input such that X∈Pi iff |X|=i. Then the average run-time of A on Pi is avgPi(A)=∫PiAdμ/∫Pi1dμ (if A terminates on every i-length input, then the denominator is simply 2i). The max run-time of A on Pi is similarly given by maxX∈Pi(μ(A(X)). More generally, given the set μPi,A={μ(A(X))}X∈Pi, we can consider any statistic S:P(im(μ))→R (where P represents the "power set" of possible finite subsets drawn from im(μ)) and say the run-time with respect to S on Pi is SPi(A)=S(μPi,A).
For historical reasons, run-time complexity analysis focused on S∈{avg,max} and