Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.
Audio version here (may not be up yet).
Please note that while I work at DeepMind, this newsletter represents my personal views and not those of my employer.
Why does deep and cheap learning work so well? (Henry W. Lin et al) (summarized by Rohin): We know that the success of neural networks must be at least in part due to some inductive bias (presumably towards “simplicity”), based on the following empirical observations:
1. Neural networks with mere millions of parameters work well with high-dimensional inputs such as images, despite the fact that, speaking loosely, there are exponentially more functions from images to classifications than there are functions expressible by million-parameter neural networks.
2. Neural networks learn solutions that generalize well even in the overparameterized regime, where statistical learning theory would predict that they overfit.
3. Relatedly, neural networks learn solutions that generalize well, despite the fact that they can memorize a randomly labeled training dataset of the same size.
Can we say more about this inductive bias towards simplicity? This paper tackles this question from the perspective of the first empirical observation: what is it about neural networks and/or reality such that relatively small neural networks can still learn the “correct” function? We can’t appeal to the fact that neural networks are universal function approximators, because that theorem doesn’t put a bound on the size of the neural network. The core idea of this paper is that any function that we care to model with neural networks in practice tends to be quite simple: in particular, it can often be expressed as a polynomial plus a few extra things.
Typically, we’re interested in modeling the relationship between some latent class y and some detailed observations or data x. For example, y might be a concept like “cat” or image labels more broadly, while x might be specific natural images. In this case the causal structure in reality looks like y → x. In our example, there is first an actual cat (y), and then via the physics of light and cameras we get the image of the cat (x).
Given this setup, why are functions of interest typically “just polynomials”? Well, thanks to Taylor expansions, all (smooth) functions can be expressed as infinite polynomials, so let’s rephrase the question: why are they polynomials with only a few terms?
The negative log probability -ln p(x | y) is called the Hamiltonian in statistical physics. There are lots of reasons you might expect that the Hamiltonian is a simple low order polynomial:
1. The Hamiltonians of several fundamental physical laws are polynomials of order 2-4. A polynomial of order d can have at most O(n^d) terms (where n is the number of input variables in the polynomial).
2. The Gaussian distribution (often created in reality thanks to the Central Limit Theorem) has a quadratic Hamiltonian (i.e. order 2).
3. Most functions of interest have a locality property: things only directly affect what is in their immediate vicinity. This causes almost all of the coefficients in the Taylor series to vanish.
4. Many functions have symmetry properties that can further reduce the number of parameters needed to specify them.
One might respond that while this could be true for simple functions like predicting the sum of independent events, this wouldn’t apply for the complex functions like “cat” → cat image. Here the authors appeal to hierarchy: in practice, the world is very hierarchical, and complex functions can usually be broken down into sequences of simpler ones. If we agree that the simple ones can be implemented with simple polynomials, then a deep neural network could simply learn the same sequence of operations (here the depth of the network is used to chain the operations one after the other).
So far we’ve argued that generative models p(x | y) tend to be simple polynomials. What about discriminative models p(y | x)? Well, if we can implement the Hamiltonian -ln p(x | y), then there is a simple way to get p(y | x): we simply calculate the Hamiltonian for all possible y, and then add in the prior probabilities -ln p(y) (which can be done through the bias term of the logit layer), and apply a softmax layer to the result. Indeed, the softmax layer at the end is best practice in ML for creating such models. In addition, in the case of a hierarchical sequence of steps, we can invert that sequence of steps and throw away unnecessary information at each step.
Okay, so far we’ve argued that the functions we care about learning can be expressed with polynomials with relatively few terms (in particular, not an exponential number of terms). What does this have to do with neural networks? It turns out that neural networks can express polynomials quite easily. In particular, the authors show:
1. Multiplication of two real numbers can be approximated arbitrarily well by a neural network with a hidden layer containing 4 neurons.
2. As a result, any given multivariate polynomial can be approximated arbitrarily well by a (potentially deep) neural network of size a little larger than 4 times the number of multiplications needed to evaluate the polynomial.
The authors also show that depth is required for the second result: for a single-layer neural network to multiply n inputs arbitrarily well, it must have at least 2^n neurons (under the assumption that the nonlinear activation function is smooth).
Rohin's opinion: While I really enjoyed this paper, I would caution against interpreting it too broadly. If we are to interpret this as a satisfactory answer to our first empirical puzzle, we’d have to say something like “reality tends to be expressible via polynomials, and neural networks tend to learn those polynomials because that is something they can do”. As the paper itself notes, just because reality is determined with low-order Hamiltonians doesn’t mean that given a subset of the information we can get by with a polynomial approximation. In addition, my guess is that if we peered into the internals of the neural networks, it would not be the case that they were calculating the sorts of polynomials that this paper talks about; rather, they would be learning some heuristics that provide some amount of evidence, and combining all these heuristics together leads to a function that is correct the majority of the time. So it’s not clear that this paper really answers our first empirical puzzle.
What I especially liked about this paper was that it analyzed the set of functions we care about (aka functions about reality) and asked what properties of reality made it such that neural networks tended to work well at approximating these functions. Note that this is similar to the common hypothesis in machine learning that the functions we are trying to learn lie on a low-dimensional manifold in a high-dimensional space. This seems like an important direction of research in understanding what neural networks do, and this paper seems like a good example of what such research could look like. I’d be excited to see similar research in the future.
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment (Robert Miles) (summarized by Rohin): This video is a great explanation of the mesa optimization paper (AN #58).
Rohin's opinion: In general, I recommend Rob’s channel for video explanations of AI alignment concepts -- it doesn’t get as much attention in this newsletter as it should, just because I personally dislike audio as a medium of communication (I much prefer to read). (Rob is also the producer of the podcast for this newsletter, so you might be listening to him right now!)
AXRP #4 - Risks from Learned Optimization (Daniel Filan and Evan Hubinger) (summarized by Rohin): This podcast delves into a bunch of questions and thoughts around mesa optimization (AN #58). Here are some of the points that stood out to me (to be clear, many of these have been covered in this newsletter before, but it seemed worth it to state them again):
- A model is a mesa optimizer if it is a mechanistic optimizer, that is, it is executing an algorithm that performs search for some objective.
- We need to focus on mechanistic optimizers instead of things that behave as though they are optimizing for some goal, because those two categories can have very different generalization behavior, and we are primarily interested in how they will generalize.
- Humans do seem like mesa optimizers relative to evolution (though perhaps not a central example). In particular, it seems accurate to say that humans look at different possible strategies and select the ones which have good properties, and thus we are implementing a mechanistic search algorithm.
- To reason about whether machine learning will result in these mechanistic optimizers, we need to reason about the inductive biases of machine learning. We mostly don’t yet know how likely they are.
- Evan expects that in powerful neural networks there will exist a combination of neurons that encode the objective, which we might be able to find with interpretability techniques.
- Even if training on a myopic base objective, we might expect the mesa objective to be non-myopic, as the non-myopic objective "pursue X" is simpler than the myopic objective "pursue X until time T".
- We can’t rely on generalization bounds to guarantee performance, since in practice there is always some distribution shift (which invalidates those bounds).
- Although it is usually phrased in the train/test paradigm, mesa optimization is still a concern in an online learning setup, since at every time we are interested in whether the model will generalize well to the next data point it sees.
- We will probably select for simple ML models (in the sense of short description length) but not for low inference time, such that mechanistic optimizers are more likely than models that use more space (the extreme version being lookup tables).
- If you want to avoid mesa optimizers entirely (rather than aligning them), you probably need to have a pretty major change from the current practice of AI, as with STEM AI and Microscope AI (explained here (AN #102)).
- Even in a CAIS scenario (AN #40) where we have (say) a thousand models doing different tasks, each of those tasks will still likely be complex enough to lead to the models being mesa optimizers.
- There are lots of mesa objectives which would lead to deceptive alignment relative to corrigible or internalized alignment, and so we should expect deceptive alignment a priori.
Formal Solution to the Inner Alignment Problem (Michael K. Cohen et al) (summarized by Rohin): Since we probably can’t specify a reward function by hand, one way to get an agent that does what we want is to have it imitate a human. As long as it does this faithfully, it is as safe as the human it is imitating. However, in a train-test paradigm, the resulting agent may faithfully imitate the human on the training distribution but fail catastrophically on the test distribution. (For example, a deceptive model might imitate faithfully until it has sufficient power to take over.) One solution is to never stop training, that is, use an online learning setup where the agent is constantly learning from the demonstrator.
There are a few details to iron out. The agent needs to reduce the frequency with which it queries the demonstrator (otherwise we might as well just have the demonstrator do the work). Crucially, we need to ensure that the agent will never do something that the demonstrator wouldn’t have done, because such an action could be arbitrarily bad.
This paper proposes a solution in the paradigm where we use Bayesian updating rather than gradient descent to select our model, that is, we have a prior over possible models and then when we see a demonstrator action we update our distribution appropriately. In this case, at every timestep we take the N most probable models, and only take an action a with probability p if every one of the N models takes that action with at least probability p. (There’s a specific rule that ensures that N decreases over time.) The total probability of all the actions will typically be less than 1 -- the remaining probability is assigned to querying the demonstrator.
The key property here is that as long as the true demonstrator is in the top N models, then the agent never autonomously takes an action with more probability than the demonstrator would. Therefore, as long as we believe the demonstrator is safe, the agent should be as well. Since the agent learns more about the demonstrator every time it queries them, over time it needs to query the demonstrator less often. Note that the higher N is, the more likely it is that the true model is one of those N models (and thus we have more safety), but also the more likely it is that we will have to query the demonstrator. This tradeoff is controlled by a hyperparameter α that implicitly determines N.
Read more: Paper: Fully General Online Imitation Learning
Rohin's opinion: One of the most important approaches to improve inner alignment is to monitor the performance of your system online, and train to correct any problems. This paper shows the benefit of explicitly quantified, well-specified uncertainty: it allows you to detect problems before they happen and then correct for them.
This setting has also been studied in delegative RL (AN #57), though there the agent also has access to a reward signal in addition to a demonstrator.
Is SGD a Bayesian Sampler? Well, almost. (Chris Mingard et al) (summarized by Zach): Neural networks have been shown empirically to generalize well in the overparameterized setting, which suggests that there is an inductive bias for the final learned function to be simple. The obvious next question: does this inductive bias come from the architecture and initialization of the neural network, or does it come from stochastic gradient descent (SGD)? This paper argues that it is primarily the former.
Specifically, if the inductive bias came from SGD, we would expect that bias to go away if we replaced SGD with random sampling. In random sampling, we sample an initialization of the neural network, and if it has zero training error, then we’re done, otherwise we repeat.
The authors explore this hypothesis experimentally on the MNIST, Fashion-MNIST, and IMDb movie review databases. They test on variants of SGD, including Adam, Adagrad, and RMSprop. Since actually running rejection sampling for a dataset would take way too much time, the authors approximate it using a Gaussian Process. This is known to be a good approximation in the large width regime.
Results show that the two probabilities are correlated over a wide order of magnitudes for different architectures, datasets, and optimization methods. While correlation isn't perfect over all scales, it tends to improve as the frequency of the function increases. In particular, the top few most likely functions tend to have highly correlated probabilities under both generation mechanisms.
Read more: Alignment Forum discussion
Zach's opinion: Fundamentally, the point here is that generalization performance is explained much more by the neural network architecture rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point. Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.
One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.
Rohin’s opinion: I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77). I don’t think this is in conflict with the claim of the paper, which is that most of the inductive bias comes from the architecture and initialization.
Other work by the same group provides some theoretical and empirical arguments that the neural network prior does have an inductive bias towards simplicity. I find those results suggestive but not conclusive, and am far more persuaded by the paper summarized here, so I don’t expect to summarize them.
Meta-learning of Sequential Strategies (Pedro A. Ortega et al) (summarized by Rohin): This paper explains theoretically how to structure meta-learning such that it is incentivized to learn optimal solutions to sequence-prediction and decision-making tasks. The core idea is to define a distribution over tasks, and then sample a new task at the beginning of each episode that the agent must then handle. Importantly, the agent is not told what the task is, and so must infer it from observations. As long as you structure the loss function appropriately, the optimal policy for the agent is to maintain a prior over the task that is updated via Bayes Rule after each observation.
Of course, since the agent is actually a neural net with memory, it does not explicitly perform Bayes Rule, but rather learns a set of weights that instantiate an update rule that effectively approximates Bayes Rule for the given task distribution. Since this update rule only needs to work on the specific task distribution being meta-trained on, it can be made significantly more efficient than a full-blown Bayes Rule, and thus can be learned by a relatively small neural net. We can think of this as the network implementing a full-blown reasoning process.
In the case of sequence prediction, we optimize the log probability assigned to the true outcomes. As a simple example, the agent might observe a sequence of coin flips from a single coin, where the bias of that coin is chosen at the beginning of each episode (and is not given to the agent). If the bias is drawn from a Normal distribution centered at 0.5, the agent will start out predicting 50-50 on Heads/Tails; if it then sees a Heads, it might update slightly to something like 55-45, and vice versa for Tails. In contrast, if the bias is drawn from a distribution where most of the mass is near 0 or 1, and very little mass is at 0.5, the agent will still start out predicting 50-50, but after seeing a Heads it will then update strongly to e.g. 90-10.
In the case of sequential decision-making, we are given a reward function; we simply optimize the expected reward using some traditional deep RL algorithm (the paper considers Q-learning).
Understanding meta-trained algorithms through a Bayesian lens (Vladimir Mikulik, Grégoire Delétang, Tom McGrath, Tim Genewein et al) (summarized by Rohin): The previous paper suggested that meta-learning can implement optimal reasoning processes in theory. Does it work in practice? This paper sets out to answer this question by studying some simple prediction and decision-making tasks.
For prediction, we consider agents that are trained on a family of distributions (e.g. Bernoulli distributions whose parameter is chosen from a Beta distribution) to predict the probability distribution after seeing a sample generated from it. For decision-making, we consider two-armed bandit problems (where again there is a distribution over the parameters of the problem). These problems were chosen because their optimal solutions can be calculated analytically.
The authors train neural nets with memory to perform well on these tasks (as discussed in the previous paper) and find that they do indeed behave optimally, achieving effectively the best possible performance. They then try to investigate whether they are implementing the same reasoning algorithm as the analytic Bayes-optimal solution. To do this, they see whether they can train a second neural net to map the hidden states (memory) of the agent to the states in the Bayes-optimal solution, and vice versa. (One way to think of this: can you simulate the Bayes-optimal algorithm using the observation encodings from the RNN, and vice versa?)
They find that they can learn a good mapping from agent states to Bayes-optimal states, but cannot learn a good mapping from Bayes-optimal states to agent states. It seems likely that the agent has states that encode more information than is necessary, and so the minimal information stored by the Bayes-optimal algorithm is insufficient to reconstruct the agent states.
Read more: Paper: Meta-trained agents implement Bayes-optimal agents
Rohin's opinion: I suspect that in these simple tasks the posterior distribution over the parameters θ maintained by the Bayes-optimal algorithm is a minimal sufficient statistic, that is, any optimal policy must have states that are sufficient to reconstruct the information stored by the Bayes-optimal algorithm. So it makes sense that, for an agent with optimal behavior, the agent’s states could be used to simulate the Bayes-optimal states. I don’t think this tells us that much about the algorithm the network is implementing.
Note that I am quite happy to see work investigating the sorts of reasoning processes that neural networks have learned. While I don’t think the specific results in this paper have told us that much, I’m excited to see this line of work scaled up to more complex tasks, where agents may not reach optimal behavior, or might do so by learning heuristics that don’t encode all of the information that the Bayes-optimal algorithm would use.
I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.
An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.
I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like double descent (AN #77).
Would you mind explaining why this is? It seems to me like random sampling would display double descent. For example, as you increase model size, at first you get more and more parameters that let you approximate the data better... but then you get too many parameters and just start memorizing the data... but then when you get even more parameters, you have enough functions available that simpler ones win out... Doesn't this story work just as well for random sampling as it does for SGD?
Hmm, I think you're right. I'm not sure what I was thinking when I wrote that. (Though I give it like 50% that if past-me could explain his reasons, I'd agree with him.)
Possibly I was thinking of epochal double descent, but that shouldn't matter because we're comparing the final outcome of SGD to random sampling, so epochal double descent doesn't come into the picture.
I normally don't think of most functions as polynomials at all - in fact, I think of most real-world functions as going to zero for large values. E.g. the function "dogness" vs. "nose size" cannot be any polynomial, because polynomials (or their inverses) blow up unrealistically for large (or small) nose sizes.
I guess the hope is that you always learn even polynomials, oriented in such a way that the extremes appear unappealing?
I believe the paper says that log densities are (approximately) polynomial - e.g. a Gaussian would satisfy this, since the log density of a Gaussian is quadratic.
What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.
Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.
(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)