This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
TL;DR: There is some hope that by penalizing time complexity we could strongly select for non-deceptive models as they require time to figure out if they are out of training. Evan Hubinger argued in his post ‘Are minimal circuits deceptive?’ that we cannot fully rely on the speed prior to ensure non-deceptiveness by attempting to show that even minimal circuits, which we can intuitively understand as “fastest programs”, can be deceptive. This is because of the risk of spontaneous meta-learning which doesn’t inherit the safety guarantees that we would hope to get from the speed prior. This argument could be wrong if meta-learning is never the fastest way to solve a problem or if spontaneous meta-learning is constrained by the speed prior.
A possible approach to tackling inner alignment failures is to shape the model’s inductive biases to make them unlikely. The two most natural inductive biases are the simplicity prior and the time complexity/speed prior. Increasing the number of parameters of the model is conducive to the simplicity prior as you are increasing the possible number of models from which SGD can optimize for the simplest model whereas decreasing the number of parameters is promoting the speed prior. 
In ‘Risks from Learned Optimization’ the concept of deceptive alignment is presented. That is, a model is said to be deceptively aligned if during training it pretends to be aligned in order to defect later on once it’s deployed. Paul Christiano previously argued that in the limit of perfect capabilities the simplicity prior can result in deceptive models. That is because, between the models that are capable of deception, the ones with less complex (unaligned) objectives are simpler than the aligned ones which have a more complex objective. He then left the open question of whether penalizing time complexity can lead to a deceptive model or not.
Deceptive models have a particular property: they need to spend time figuring out if they are out of the training or not. The hope would be that by penalizing time complexity the model wouldn’t have time to figure out if it is out of training and therefore it would not be capable of deception.
Minimal circuits can be deceptive
Evan Hubinger attempted to resolve this open question in his post ‘Are minimal circuits deceptive?’ by thinking in terms of minimal circuits which we can intuitively understand as “fastest programs”. The conclusion of his argument is that even the fastest programs that solve a set of tasks can be deceptive, potentially crushing the hope that strongly penalizing time complexity can ensure non-deceptive models.
To understand why one has to first understand mesa-optimization. It is a neologism introduced in ‘Risks from Learned Optimization’ which means that a learned model is also an optimiser, in other words, if your optimizer (base-optimizer) learns another optimizer, then the learned optimizer is the mesa-optimizer.
Now consider the possibility that your learning algorithm learns a specific type of optimizer: another learning algorithm (meta-learning). There are some cases in which the fastest way to solve a problem (set of tasks) could be to do meta-learning, in fact, that’s what we do. It seems likely that for a problem like classifying objects needed for an average human world model, meta-learning is the fastest way to do that.
There is one issue with this though: the fastest meta-learning algorithms don’t necessarily learn the fastest models, which would eliminate the safety guarantees we hoped to get from the speed prior. If you try to find the fastest way to write a program, you cannot guarantee that the program you'll write will itself be fast. Similarly, the fastest way to do meta-learning doesn’t guarantee that the resulting model will be fast. Even if theoretically the fastest possible model that solves a specific task cannot be deceptive, the fastest possible model that solves a complex enough set of tasks might implement a meta-learning algorithm that learns a slower model that can be deceptive.
It is important to note that affirming the existence of a minimal circuit that is able to be deceptive is trivial if we try to use it to solve a problem that requires deception. This argument applies to all of the other cases in which solving the problem doesn’t necessarily require deception, making the result non-trivial.
Ways this could be wrong
Meta-learning is never the fastest way to solve a problem
One could argue that meta-learning is needed in practice just because we don’t know how to solve a task but if we did, implementing the solution directly would always be faster than using meta-learning to find it. This essentially asks why wouldn’t the fastest model be a GLUT (Giant Lookup Table) always looking for each solution and giving it as output given the correct input instead of implementing meta-learning to find the solutions.
One potential answer is that initially, it would look like a lookup table, but spontaneous meta-learning could still be favoured by the speed prior later on. As the number of tasks increases, so do the number of elements in the lookup table, that is, GLUT’s description length grows linearly with the number of tasks and its time complexity grows logarithmically. In the case of using a meta-learning algorithm to find the solutions, the description length will grow logarithmically and its time complexity will be constant at first, but similarly to the GLUT, it will have an approximately logarithmic time complexity in the limit as a finite neural network can learn to solve a finite number of tasks and the only way of allowing it to learn more is by making it bigger - increasing the number of parameters.
To understand why spontaneous meta-learning could be favoured one has to get a better understanding of what the speed prior is: a combination of time complexity and description length. Given that meta-learning probably has the same time complexity in the limit with the GLUT, but a shorter description length, the speed prior will favour spontaneous meta-learning. This means that even if initially the fastest model would look like a GLUT, when it has to solve general enough problems that involve a lot of tasks, spontaneous meta-learning could still be faster.
A possible way this solution wouldn’t hold is if we consider the case of a compressed lookup table. Meta-learning could have a shorter description length because it compresses knowledge as opposed to a classical lookup table which adds a new member for each new corresponding input. A compressed lookup table could potentially have a shorter description length than GLUT, even one that grows logarithmically, implying that the speed prior wouldn’t necessarily favour meta-learning, but this requires further investigation.
Spontaneous meta-learning is constrained by the speed prior
Another way this argument could not hold is if spontaneous meta-learning is constrained by the speed prior. The argument discussed relies on one key fact: the fastest meta-learning algorithms don’t necessarily learn the fastest models. While this is true in theory, there is a possibility that the speed prior during the training process would constrain both the base-optimizer (meta-learning) and the mesa-optimizer (learning algorithm), bringing back the safety guarantees we hoped to get from using the speed prior.
To see how this could happen, imagine SGD selects for a mesa-optimizer, and we implement the speed prior by penalizing time complexity in the loss function. The mesa-optimizer can be modified by SGD (driven by the loss function) which means that the mesa-optimizer could be constrained by the loss function that penalizes time complexity, resulting in a mesa-optimizer that is constrained by the speed prior. Different ways of implementing the speed prior could potentially avoid this counterexample.
This argument can have important implications for safety. On top of the fact that, as Paul Christiano argued, it is possible we cannot rely on the simplicity prior to avoid deceptive models, Evan Hubinger argued that we cannot rely on the speed prior either:
“this counterexample suggests that in situations involving spontaneous meta-learning we shouldn’t necessarily expect time complexity regularization to guarantee inner alignment, as any meta-learning that has to be done won’t inherit any safety guarantees”
I think this is important as the speed prior was considered to be, and still is by many, a very good candidate for a way of not producing deceptive models. It might still be a part of the solution for avoiding inner alignment failures, but Evan Hubinger's argument suggests that it is not the entire solution.
 (Neural networks are fundamentally Bayesian - Medium, 2020)
 (Inductive biases stick around - LessWrong, 2019)
 (Risks from Learned Optimization in Advanced Machine Learning Systems - arXiv, 2019)
 (On Characterizing the Capacity of Neural Networks using Algebraic Topology - arXiv, 2018)
 (The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions - SpringerLink, 2002)