Glass box learners want to be black box

Epistemic status: intuitive speculation with scattered mathematical justification.

My goal here is to interrogate the dream of writing a beautiful algorithm for intelligence and thereby ensuring safety. For example:

I don't know precisely what alternative he had in mind, but I only seem to remember reading clean functional programs from MIRI, so that is one possibility. Whether or not anyone else endorses it, that is the prototypical picture I have in mind as the holy grail of glass box AGI implementation. In my mind, it has careful comments explaining the precisely chosen, recursive prior and decision rules that make it go "foom."

Is the "inscrutable" complexity of deep neural networks unavoidable? There is has been some prior discussion of the desire to avoid it as map-territory confusion, and I am not sure if that is true (though I have some fairly subtle suspicions). However, I want to approach this question from a different angle.

And most importantly, I want to ask what this has to do with AI safety. Would an elegant implementation of AGI be safe? What about an elegant and deeply understood implementation?^[1]

In a previous essay, I considered the distinction between understanding or discovering core algorithms of intelligence and searching for optimality guarantees. It occurs to me that the former is really a positive, creative exercise while the later is more negative. Or, perhaps it is more accurate to say that an optimal method must sit at the boundary of the positive and the negative; it is exactly as good as possible, and anything better would be impossible.^[2]

Sometimes, there is no meaningfully optimal element in a class. For instance, there is no computable predictor which always learns to do as well as every other computable predictor on all sequences.^[3] This is a failure of a relative notion of optimality. But perhaps it is more important to ask for good performance on some sequences than others. I think rationalists tend to (hyper?)focus on consistency-based conditions like "Bayes optimality," which I am not really sure can be viewed as optimality conditions at all (though they are certainly desirable to satisfy, all else held equal). Perhaps it is useful to think of the relative notions as external-facing and the consistency notions as internal-facing. I don't want to be too specific here about what makes for a good optimality condition for intelligence, because I think that if I tried to be very specific, I would probably be wrong. But I think some kind of best-in-class property is probably what we mean by optimality, and I suspect that this ultimately (that is, terminally) wants to be externally oriented.

Optimality as a boundary

You can approach an optimal method from below or from above. From below, by inventing better and better methods. Sometimes, the class of methods under consideration has structure that makes it easy to construct an optimal method. Sometimes it does not. For instance, there are many algorithms for SAT, it is not so easy to find an asymptotically fastest one. Each faster algorithm invented is a demonstration of what is possible. Optimality can also be approached from above, by trying to prove limits on the possible (exercise to the reader: prove that SAT cannot be solved in polynomial time).

Sometimes, the attitudes of the two different directions are sufficiently different that they form fields with different names - for instance, algorithms and complexity. In this case, the two often appear together with an "and" between them, for obvious reasons.

In general, there is no reason to believe that the set of possible things is "closed" - there may be an infinite sequence of slight improvements above any method. The possible and impossible do not need to touch. Existence proof: Blum's speedup theorem.

Sometimes they do touch, at least in some sense; Solomonoff's universal distribution is an example, optimal among the lower semi-computable distributions. But this is arguably^[4] a contrived class of methods.

Approaching the boundary

I think that as intelligent algorithms approach the boundary, they tend to get really complicated. But I don't think they have to start out really complicated - a simple "kernel" should do. I have several independent reasons for believing this. Some of the reasons connect more or less directly / rigorously, and some of them apply more or less widely. Since the relationship between these two virtues seems to here be inverse, I don't think the case is closed.

Deep learning and more generally the bitter lesson. This is more of an intuitive argument. It seems that (large) neural nets become very, very complicated as they learn. In fact if I recall correctly, Neel Nanda recently described them as "cursed" during a talk at LISA^[5]. This seems to be reflective of a general trend towards increasing pragmatism (some might say pessimism) about truly understanding neural nets mechanistically.^[6] It's notable that the algorithms for transformer training and inference are not actually very complicated (though they do not qualify as elegant either).

No elegant theory of universal prediction. A predictor that can learn sequences up to some complexity must be about that complex itself, as proven by Shane Legg: https://arxiv.org/abs/cs/0606070. It's interesting to interpret this result in light of transformers. The transformer learning algorithm is fairly simple, but we don't actually care about their performance during training. When trained on a massive amount of text, transformers seem to be able to (learn to) learn a lot of things in context (particularly, during deployment) - in fact, it is their in-context learning which is often compared to Solomonoff induction. But by deployment, it is no longer a raw transformer that is learning; it is an LLM. And an LLM is a highly complicated object.

A maximally compressed object looks like noise, in the sense that it (by definition) cannot be compressed further. This a standard "result" in algorithmic information theory. Insofar as understanding is compression, I visualize acquiring deep knowledge as packing your brain with white noise - or what looks like it, from the outside. To the right machine, it is executable noise. See this related talk about the singularity by Marcus Hutter.

...and Blum's speedup theorem, as mentioned above.

Implications for safety

I think it is possible to write a fairly short algorithm for intelligence. I think that algorithm learns and self-modifies, and eventually it looks like noise.

It's not clear that starting from a short algorithm actually buys you anything in terms of safety, once it has self-modified into noise.

Safety must be a design property of the algorithm. Presumably, a safe algorithm has to be optimizing a thing that is safe to optimize. However, it may well be that a complicated algorithm, optimizing the same thing, is about as safe. It's not even clear that the simplicity of an algorithm makes it easier to determine what it is optimizing; there is a sense in which transformers are just optimizing their loss function, which is (during pretraining) exactly log-loss on next-token prediction. But that is not quite right, for reasons that can be roughly summed up as "inner optimization." However, I suspect that the simplest algorithms which eventually unfold into intelligence also have inner optimization problems.

I think we shouldn't expect to be able to implement an optimal algorithm for intelligence at all, and an elegant core engine for intelligence is not necessarily safe. Still, a deeper understand of intelligence probably can lead to somewhat more elegant and hopefully much safer algorithms. The former, because some epicycles can be removed, leading to cleaner analysis. The latter, because the optimization target can be chosen wisely.

Still, it is important to be clear that AGI code golf is not inherently productive for AI safety.

^{^}
Does the answer seem obvious to you? Of the people who think the answer is obvious, I weakly expect disagreement about both the answer and the reason(s). If you have a strong opinion, I'd appreciate comments both before and after reading the rest of the post.
^{^}
I think I may have read this phrasing somewhere, but I can't recall where.
^{^}
See Carnap's quest for a universal "confirmation rule" (in the modern context, usually viewed as a sequence prediction method) and Putnam's diagonal argument for impossibility.
^{^}
See Professor Tom Sterkenburg's excellent thesis: Universal Prediction
^{^}
I am currently (as of writing) working out of LISA for ARENA 5.0.
^{^}
I feel slightly vindicated by this, which I have to admit is comforting since I have changed my mind about a lot of other things lately!

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

18

Glass box learners want to be black box

18

Optimality as a boundary

Approaching the boundary

Implications for safety