If you've come here via 3Blue1Brown, hi! If want to learn more about interpreting neural networks in general, here are some resources you might find useful:
This is a write up of the Google DeepMind mechanistic interpretability team’s investigation of how language models represent facts. This is a sequence of 5 posts, we recommend prioritising reading post 1, and thinking of it as the “main body” of our paper, and posts 2 to 5 as a series of appendices to be skimmed or dipped into in any order.
Reverse-engineering circuits with superposition is a major unsolved problem in mechanistic interpretability: models use...
(Last revised: January 2026. See changelog at the bottom.)
Part of the “Intro to brain-like-AGI safety” post series.
Thus far in the series, Post #1 set out some definitions and motivations (what is “brain-like AGI safety” and why should we care?), and Posts #2 & #3 split the brain into a Learning Subsystem (cortex, striatum, cerebellum, amygdala, etc.) that “learns from scratch” using learning algorithms, and a Steering Subsystem (hypothalamus, brainstem, etc.) that is mostly genetically-hardwired and executes innate species-specific instincts and reactions.
Then in Post #4, I talked about the “short-term predictor”, a circuit which learns, via supervised learning, to predict a signal in advance of its arrival, but only by perhaps a fraction of a second. Post #5 then argued that if we form a closed...
I think “predict sensory input” is the main training signal for the Thought Generator, loosely analogous to how “predict next token” is the training signal for LLM pretraining. (Cf. §4.7.) So “predict sensory inputs” wouldn’t be a separate box from the Thought Generator, but rather a core function of the Thought Generator. Does that help? Sorry if I’m missing your point.
Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.
Of course, this is a bit of an exaggeration. No single paper ever kills a field of research on its own, and deep learning theory was not exactly the most productive and healthy field at the time this was published. And the paper didn't come close to addressing all theoretical approaches to understanding aspects of deep learning. But if I had to point to a single paper that shattered the feeling of optimism at the time, it would be Zhang et al. 2016.[1]
Believe it or not, this unassuming table rocked the field of deep learning theory back in 2016, despite probably involving
These randomly trained models, are they uncertain or confidently wrong on the test data?
My model of what is going on here is that stochastic gradient descent is acting roughly like an MCMC sampling method. It's producing a random sample from the space of low loss parameters. And that the simpler hypothesis correspond to larger parameter space volumes.
When the network needs to memorize, it needs to use nearly all it's parameters, meaning a small parameter-space volume. When the network is learning a pattern, it's only using a small fraction of it's parame...
In this post, we describe a generalization of Markov decision processes (MDPs) and partially observable Markov decision processes (POMDPs) called crisp supra-MDPs and supra-POMDPs. The new feature of these decision processes is that the stochastic transition dynamics are multivalued, i.e. specified by credal sets. We describe how supra-MDPs give rise to crisp causal laws, the hypotheses of infra-Bayesian reinforcement learning. Furthermore, we discuss how supra-MDPs can approximate MDPs by a coarsening of the state space. This coarsening allows an agent to be agnostic about the detailed dynamics while still having performance guarantees for the full MDP.
Analogously to the classical theory, we describe an algorithm to compute a Markov optimal policy for supra-MDPs with finite time horizons. We also prove the existence of a stationary optimal policy for...
Shifting the losses by one time step doesn't really matter, since we're mostly interested in the shape of the regret bound which (up to mild changes in the constants) is not affected by this.
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data,...
Yep
Credal sets, a special case of infradistributions[1] in infra-Bayesianism and classical objects in imprecise probability theory, provide a means of describing uncertainty without assigning exact probabilities to events as in Bayesianism. This is significant because as argued in the introduction to this sequence, Bayesianism is inadequate as a framework for AI alignment research. We will focus on credal sets rather than general infradistributions for simplicity of the exposition.
Recall that the total-variation metric is one example of a metric on the set of probability distributions over a finite set A set is closed with respect to a metric if it contains all of its limit points with respect to the metric. For example, let The set of probability distributions over is given by
There is a bijection between and the closed interval which is...
a subset? Why is it not just that product space? I'm assuming it's because this is a set of partial functions, but I don't see how taking a subset lets you account for that.
You're absolutely right, it should be a quotient space, not a subspace. In principle, it can be represented as a closed subspace of the product of copies of
In this case, as written, you don't need to say "An open set is then an arbitrary union of basis elements"
Actually, we do? For example, consider the space
However, thi...
A general guide for pursuing independent research, from conceptual questions like "how to figure out how to prioritize, learn, and think", to practical questions like "what sort of snacks to should you buy to maximize productivity?"
I just read all of post 4
https://www.lesswrong.com/s/hpWHhjvjn67LJ4xXX/p/JRcNNGJQ3xNfsxPj4
There is nothing there about a toy model mapping pairs of integers.
What am I missing?