EDIT 15.12.2025: This post is now superseded by this one. It was just a sketch I wrote up to clarify my own thinking, and contains many mistakes. The newer post is the finished product.

I think we may be able to prove that Bayesian learning on transformers^[1] or recurrent neural networks with a uniform^[2] prior over parameters is equivalent to a form of Solomonoff induction over a set of computationally-bounded programs. This bounded Solomonoff induction would still be 'approximately optimal' in a sense, being able to predict the data about as well any other bounded prediction procedure included in the set of programs it runs over. This proof would link Singular Learning Theory (SLT) back to basic Algorithmic Information Theory (AIT).

This post is my current early-stage sketch of the proof idea. Don't take it too seriously yet. I’m writing this out mostly to organise my own thoughts. I'd originally planned for it to be a shortform, but I think it ended up a bit too long for that.

Background:

I recently held a small talk presenting an idea for how and why deep learning generalises. Slides for the talk here, slide discussion here.

In the talk, I tried to reduce concepts from SLT^[3] back to AIT^[4]. I sketched a story about deep learning, or perhaps even learning more generally, that goes like this:

Bayesian learning on (recurrent) neural networks is equivalent to a form of Solomonoff induction running over a set of programs bounded in length, runtime and memory usage.
Using SGD/genetic algorithms/your-fancy-update-method-of-choice to train a neural network is then a cheap bargain bin^[5] approximation of Bayesian learning on the neural network. Training steps are biased to make simple updates rather than complex updates because exponentially more parameter configurations in the architecture correspond to simpler programs.

Now, I want to actually prove this story. Specifically, I want to prove the first part: That Bayesian learning on transformers or RNNs is equivalent to a computationally bounded form of Solomonoff Induction (SI), in a sense I want to make precise. I also want to show that this bounded SI is a sensible approximation of actual unbounded SI, given an assumption that the data is 'efficiently predictable'. That is, we assume that it is possible at all to predict the data to good-enough accuracy with a limited c...

Alexander Gietelink Oldenziel's Shortform

2