*This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the *__Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program__*.*

**TL;DR:** There is some hope that by penalizing time complexity we could strongly select for non-deceptive models as they require time to figure out if they are out of training. Evan Hubinger argued in his post __‘Are minimal circuits deceptive?’__ that we cannot fully rely on the speed prior to ensure non-deceptiveness by attempting to show that even minimal circuits, which we can intuitively understand as “fastest programs”, can be deceptive. This is because of the risk of spontaneous meta-learning which doesn’t inherit the safety guarantees that we would hope to get from the speed prior. This argument could be wrong if meta-learning is never the fastest way to solve a problem or if spontaneous meta-learning is constrained by the speed prior.

## Introduction

A possible approach to tackling inner alignment failures is to shape the model’s inductive biases to make them unlikely. The two most natural inductive biases are the simplicity prior and the time complexity/speed prior. Increasing the number of parameters of the model is conducive to the simplicity prior as you are increasing the possible number of models from which SGD can optimize for the simplest model whereas decreasing the number of parameters is promoting the speed prior.^{[}^{1}^{] [}^{2}^{]}

In ‘Risks from Learned Optimization’^{[}^{3}^{]} the concept of deceptive alignment is presented. That is, a model is said to be *deceptively aligned* if during training it pretends to be aligned in order to defect later on once it’s deployed. Paul Christiano __previously argued__ that in the limit of perfect capabilities the simplicity prior can result in deceptive models. That is because, between the models that are capable of deception, the ones with less complex (unaligned) objectives are simpler than the aligned ones which have a __more complex__ objective. He then left the open question of whether penalizing time complexity can lead to a deceptive model or not.

Deceptive models have a particular property: they need to spend time figuring out if they are out of the training or not. The hope would be that by penalizing time complexity the model wouldn’t have time to figure out if it is out of training and therefore it would not be capable of deception.

## Minimal circuits can be deceptive

Evan Hubinger attempted to resolve this open question in his post __‘Are minimal circuits deceptive?’__ by thinking in terms of __minimal circuits __which we can intuitively understand as “fastest programs”. The conclusion of his argument is that even the fastest programs that solve a set of tasks can be deceptive, potentially crushing the hope that strongly penalizing time complexity can ensure non-deceptive models.

To understand why one has to first understand mesa-optimization. It is a neologism introduced in ‘Risks from Learned Optimization’ which means that a learned model is also an optimiser, in other words, if your optimizer (base-optimizer) learns another optimizer, then the learned optimizer is the mesa-optimizer.

Now consider the possibility that your learning algorithm learns a specific type of optimizer: another learning algorithm (meta-learning). There are some cases in which the fastest way to solve a problem (set of tasks) could be to do meta-learning, in fact, that’s what we do. It seems likely that for a problem like classifying objects needed for an average human world model, meta-learning is the fastest way to do that.

There is one issue with this though: the fastest meta-learning algorithms don’t necessarily learn the fastest models, which would eliminate the safety guarantees we hoped to get from the speed prior. If you try to find the fastest way to write a program, you cannot guarantee that the program you'll write will itself be fast. Similarly, the fastest way to do meta-learning doesn’t guarantee that the resulting model will be fast. Even if theoretically the fastest possible model that solves a specific task cannot be deceptive, the fastest possible model that solves a complex enough set of tasks might implement a meta-learning algorithm that learns a slower model that can be deceptive.

It is important to note that affirming the existence of a minimal circuit that is able to be deceptive is trivial if we try to use it to solve a problem that requires deception. This argument applies to all of the other cases in which solving the problem doesn’t necessarily require deception, making the result non-trivial.

## Ways this could be wrong

### Meta-learning is never the fastest way to solve a problem

One could __argue__ that meta-learning is needed in practice just because we don’t know how to solve a task but if we did, implementing the solution directly would always be faster than using meta-learning to find it. This essentially asks why wouldn’t the fastest model be a GLUT (Giant Lookup Table) always looking for each solution and giving it as output given the correct input instead of implementing meta-learning to find the solutions.

One potential answer is that initially, it would look like a lookup table, but spontaneous meta-learning could still be favoured by the speed prior later on. As the number of tasks increases, so do the number of elements in the lookup table, that is, GLUT’s description length grows linearly with the number of tasks and its time complexity grows logarithmically. In the case of using a meta-learning algorithm to find the solutions, the description length will grow logarithmically and its time complexity will be constant at first, but similarly to the GLUT, it will have an approximately logarithmic time complexity in the limit as a finite neural network can learn to solve a finite number of tasks and the only way of allowing it to learn more is by making it bigger - increasing the number of parameters^{[}^{4}^{]}.

To understand why spontaneous meta-learning could be favoured one has to get a better understanding of what the speed prior is: a combination of time complexity and description length^{[}^{5}^{]}. Given that meta-learning probably has the same time complexity in the limit with the GLUT, but a shorter description length, the speed prior will favour spontaneous meta-learning. This means that even if initially the fastest model would look like a GLUT, when it has to solve general enough problems that involve a lot of tasks, spontaneous meta-learning could still be faster.

A possible way this solution wouldn’t hold is if we consider the case of a compressed lookup table. Meta-learning could have a shorter description length because it compresses knowledge as opposed to a classical lookup table which adds a new member for each new corresponding input. A compressed lookup table could potentially have a shorter description length than GLUT, even one that grows logarithmically, implying that the speed prior wouldn’t necessarily favour meta-learning, but this requires further investigation.

### Spontaneous meta-learning is constrained by the speed prior

Another way this argument could not hold is if spontaneous meta-learning is constrained by the speed prior. The argument discussed relies on one key fact: the fastest meta-learning algorithms don’t necessarily learn the fastest models. While this is true in theory, there is a possibility that the speed prior during the training process would constrain both the base-optimizer (meta-learning) and the mesa-optimizer (learning algorithm), bringing back the safety guarantees we hoped to get from using the speed prior.

To see how this could happen, imagine SGD selects for a mesa-optimizer, and we implement the speed prior by penalizing time complexity in the loss function. The mesa-optimizer can be modified by SGD (driven by the loss function) which means that the mesa-optimizer could be constrained by the loss function that penalizes time complexity, resulting in a mesa-optimizer that is constrained by the speed prior. Different ways of implementing the speed prior could potentially avoid this counterexample.

## Conclusion

This argument can have important implications for safety. On top of the fact that, as Paul Christiano argued, it is possible we cannot rely on the simplicity prior to avoid deceptive models, Evan Hubinger argued that we cannot rely on the speed prior either:

“this counterexample suggests that in situations involving spontaneous meta-learning we shouldn’t necessarily expect time complexity regularization to guarantee inner alignment, as any meta-learning that has to be done won’t inherit any safety guarantees”

I think this is important as the speed prior was considered to be, and still is by many, a very good candidate for a way of not producing deceptive models. It might still be a part of the solution for avoiding inner alignment failures, but Evan Hubinger's argument suggests that it is not the entire solution.

## References

[1] (Neural networks are fundamentally Bayesian - Medium, 2020)

__https://towardsdatascience.com/neural-networks-are-fundamentally-bayesian-bee9a172fad8__

[2] (Inductive biases stick around - LessWrong, 2019)

__https://www.lesswrong.com/posts/nGqzNC6uNueum2w8T/inductive-biases-stick-around__

[3] (Risks from Learned Optimization in Advanced Machine Learning Systems - arXiv, 2019)

__https://arxiv.org/abs/1906.01820__

[4] (On Characterizing the Capacity of Neural Networks using Algebraic Topology - arXiv, 2018)

__https://arxiv.org/abs/1802.04443__

[5] (The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Computable Predictions - SpringerLink, 2002)

__https://link.springer.com/chapter/10.1007/3-540-45435-7_15__

Why isn't the learned 'compression' algorithm itself an inner algorithm that the meta-learner is learning and thus incentivized? "Compression = intelligence."

A 'dumb' off the shelf compressor like xz isn't going to be able to do much with a few petabytes of chess endgame databases; but there are generic algorithms which can compute endgame databases and those programs are more like kilobytes in size. A space-time weighted algorithm would then probably look something like endgame databases for the most frequent endgames (megabytes), and then on the fly dynamic programming to compute everything missing with speedups from the cached databases.

This is generic to many board games, and so a meta-learner tasked with many board games should learn the template and how to specialize it to specific games.

A GLUT can have constant time complexity using a hash table, which makes it a lot less clear that metalearning can be faster

Minimal circuits are not quite the same as fastest programs---they have no adaptive computation, so you can't e.g. memorize a big table but only use part of it. In some sense it's just one more (relatively extreme) way of making a complexity-speed tradeoff I basically agree that a GLUT is always faster than meta-learning if you have arbitrary adaptive computation.

That said, I don't think it's totally right to call a GLUT constant complexity---if you have an n bit input and m bit output, then it takes at least n+m operations to compute the GLUT (in any reasonable low-level model of computation).

There are even more speed-focused methods than minimal circuits. I think the most extreme versions are more like query complexity or communication complexity, which in some sense are just asking how fast you can make your GLUT---can you get away with reading only a small set of input bits? But being totally precise about "fastest" requires being a lot more principled about the model of computation.

Any n-bit hash function will produce collisions when the number of elements in the hash table gets large enough (after the number of possible hashes stored in n bits has been reached) so adding new elements will require rehashing to avoid collisions making GLUT have a logarithmic time complexity in the limit. Meta-learning can also have a constant time complexity for an arbitrarily large number of tasks, but not in the limit, assuming a finite neural network.