Rohin Shah

PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


[AN #139]: How the simplicity of reality explains the success of neural nets

What John said. To elaborate, it's specifically talking about the case where there is some concept from which some probabilistic generative model creates observations tied to the concept, and claiming that the log probabilities follow a polynomial.

Suppose the most dog-like nose size is K. One function you could use is y = exp(-(x - K)^d) for some positive integer d. That's a function whose maximum value is 0 (where higher values = more "dogness") and doesn't blow up unreasonably anywhere.

(Really you should be talking about probabilities, in which case you use the same sort of function but then normalize, which transforms the exp into a softmax, as the paper suggests)

Utility Maximization = Description Length Minimization

The core conceptual argument is: the higher your utility function can go, the bigger the world must be, and so the more bits it must take to describe it in its unoptimized state under M2, and so the more room there is to reduce the number of bits.

If you could only ever build 10 paperclips, then maybe it takes 100 bits to specify the unoptimized world, and 1 bit to specify the optimized world.

If you could build 10^100 paperclips, then the world must be humongous and it takes 10^101 bits to specify the unoptimized world, but still just 1 bit to specify the perfectly optimized world.

If you could build ∞ paperclips, then the world must be infinite, and it takes ∞ bits to specify the unoptimized world. Infinities are technically challenging, and John's comment goes into more detail about how you deal with this sort of case.

For more intuition, notice that exp(x) is a bijective function from (-∞, ∞) to (0, ∞), so it goes from something unbounded on both sides to something unbounded on one side. That's exactly what's happening here, where utility is unbounded on both sides and gets mapped to something that is unbounded only on one side.

Some thoughts on risks from narrow, non-agentic AI

I think there's a perspective where the post-singularity failure is still the important thing to talk about, and that's an error I made in writing the post. I skipped it because there is no real action after the singularity---the damage is irreversibly done, all of the high-stakes decisions are behind us---but it still matters for people trying to wrap their heads around what's going on. And moreover, the only reason it looks that way to me is because I'm bringing in a ton of background empirical assumptions (e.g. I believe that massive acceleration in growth is quite likely), and the story will justifiably sound very different to someone who isn't coming in with those assumptions.

Fwiw I think I didn't realize you weren't making claims about what post-singularity looked like, and that was part of my confusion about this post. Interpreting it as "what's happening until the singularity" makes more sense. (And I think I'm mostly fine with the claim that it isn't that important to think about what happens after the singularity.)

Some thoughts on risks from narrow, non-agentic AI

I've suggested two ways in which automated reasoning may be more easily applied to certain long-term goals (namely the goals that are natural generalizations of training objectives, or goals that are most easily discovered in neural networks).

This makes sense to me, and seem to map somewhat onto Parts 1 and 2 of WFLL.

However, you also call those parts "going out with a whimper" and "going out with a bang", which seems to be claims about the impacts of bad generalizations. In that post, are you intending to make claims about possible kinds of bad generalizations that ML models could make, or possible ways that poorly-generalizing ML models could lead to catastrophe (or both)?

Personally, I'm pretty on board with the two types of bad generalizations as plausible things that could happen, but less on board with "going out with a whimper" as leading to catastrophe. It seems like you at least need to explain why in that situation we can't continue to work on the alignment problem and replace the agents with better-aligned AI systems in the future. (Possibly the answer is "the AI systems don't allow us to do this because it would interfere with their continued operation".)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

More explanation on why I'm not summarizing the paper about Kolmogorov complexity:

I don’t find the Kolmogorov complexity stuff very convincing. In general, algorithmic information theory tends to have arguments of the form “since this measure simulates every computable process, it (eventually) at least matches any particular computable process”. This feels pretty different from normal notions of “simplicity” or “intelligence”, and so I often try to specifically taboo phrases like “Solomonoff induction” or “Kolmogorov complexity” and replace them with something like “by simulating every possible computational process”, and see if the argument still seems convincing. That mostly doesn’t seem to be the case here.

If I try to do this with the arguments here, I get something like:

Since it is possible to compress high-probability events using an optimal code for the probability distribution, you might expect that functions with high probability in the neural network prior can be compressed more than functions with low probability. Since high probability functions are more likely, this means that the more likely functions correspond to shorter programs. Since shorter programs are necessarily more likely in the prior that simulates all possible programs, they should be expected to be better programs, and so generalize well.

This argument just doesn’t sound very compelling to me. It also can be applied to literally any machine learning algorithm; I don’t see why this is specific to neural nets. If this is just meant to explain why it is okay to overparameterize neural nets, then that makes more sense to me, though then I’d say something like “with overparameterized neural nets, many different parameterizations instantiate the same function, and so the ‘effective parameterization’ is lower than you might have thought”, rather than saying anything about Kolmogorov complexity.

(This doesn't capture everything that simplicity is meant to capture -- for example, it doesn't capture the argument that neural networks can express overfit-to-the-training-data models, but those are high complexity and so low likelihood in the prior and so don't happen in general; but as mentioned above I find the Kolmogorov complexity argument for this pretty tenuous.)

Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

Zach's summary for the Alignment Newsletter (just for the SGD as Bayesian sampler paper):

Neural networks have been shown empirically to generalize well in the overparameterized setting, which suggests that there is an inductive bias for the final learned function to be simple. The obvious next question: does this inductive bias come from the _architecture_ and _initialization_ of the neural network, or does it come from stochastic gradient descent (SGD)? This paper argues that it is primarily the former.

Specifically, if the inductive bias came from SGD, we would expect that bias to go away if we replaced SGD with random sampling. In random sampling, we sample an initialization of the neural network, and if it has zero training error, then we’re done, otherwise we repeat.

The authors explore this hypothesis experimentally on the MNIST, Fashion-MNIST, and IMDb movie review databases. They test on variants of SGD, including Adam, Adagrad, and RMSprop. Since actually running rejection sampling for a dataset would take _way_ too much time, the authors approximate it using a Gaussian Process. This is known to be a good approximation in the large width regime.

Results show that the two probabilities are correlated over a wide order of magnitudes for different architectures, datasets, and optimization methods. While correlation isn't perfect over all scales, it tends to improve as the frequency of the function increases. In particular, the top few most likely functions tend to have highly correlated probabilities under both generation mechanisms.

Zach's opinion:

Fundamentally the point here is that generalization performance is explained much more by the neural network architecture, rather than the structure of stochastic gradient descent, since we can see that stochastic gradient descent tends to behave similarly to (an approximation of) random sampling. The paper talks a bunch about things like SGD being (almost) Bayesian and the neural network prior having low Kolmogorov complexity; I found these to be distractions from the main point. Beyond that, approximating the random sampling probability with a Gaussian process is a fairly delicate affair and I have concerns about the applicability to real neural networks.

One way that SGD could differ from random sampling is that SGD will typically only reach the boundary of a region with zero training error, whereas random sampling will sample uniformly within the region. However, in high dimensional settings, most of the volume is near the boundary, so this is not a big deal. I'm not aware of any work that claims SGD uniformly samples from this boundary, but it's worth considering that possibility if the experimental results hold up.

Rohin’s opinion:

I agree with Zach above about the main point of the paper. One other thing I’d note is that SGD can’t have literally the same outcomes as random sampling, since random sampling wouldn’t display phenomena like <@double descent@>(@Deep Double Descent@). I don’t think this is in conflict with the claim of the paper, which is that _most_ of the inductive bias comes from the architecture and initialization.

[Other]( [work]( by the same group provides some theoretical and empirical arguments that the neural network prior does have an inductive bias towards simplicity. I find those results suggestive but not conclusive, and am far more persuaded by the paper summarized here, so I don’t expect to summarize them.

Recursive Quantilizers II

I think what's really interesting to me is making sure the system is reasoning at all those levels, because I have an intuition that that's necessary (to get concepts we care about right).

I'm super on board with this desideratum, and agree that it would not be a good move to change it to some fixed number of levels. I also agree that from a conceptual standpoint many ideas are "about all the levels".

My questions / comments are about the implementation proposed in this post. I thought that you were identifying "levels of reasoning" with "depth in the idealized recursive QAS tree"; if that's the case I don't see how feedback at one level generalizes to all the other levels (feedback at that level is used to make the QAS at that level, and not other levels, right?)

I'm pretty sure I'm just failing to understand some fact about the particular implementation, or what you mean by "levels of reasoning", or its relation to the idealized recursive QAS tree.

This is where my proposal differs from proposals more reliant on human imitation. Any particular thing we can say about what better reasoning would look like, the system attempts to incorporate.

I would argue this is also true of learning from human preferences (comparisons), amplification, and debate; not sure if you would disagree. I agree straight human imitation wouldn't do this.

In principle, you could re-start the whole training process after each interaction, so that each new piece of training data gets equal treatment (it's all part of what's available "at the start").

Huh? I thought the point was that your initial feedback can help you interpret later feedback. So maybe you start with Boltzmann rationality, and then you get some feedback from humans, and now you realize that you should interpret all future feedback pragmatically.

It seems like you have to choose one of two options:

  1. Order of feedback does matter, in which case bad early feedback can lock you in to a bad outcome
  2. Order of feedback doesn't matter, in which case you can't improve your interpretation of feedback over time (at least, not in a consistent way)

(This seems true more generally for any system that aims to learn at all the levels, not just for the implementation proposed in this post.)

It seems to me like you're trying to illustrate something like "Abram's proposal doesn't get at the bottlenecks".

I think it's more like "I'm not clear on the benefit of this proposal over (say) learning from comparisons". I'm not asking about bottlenecks; I'm asking about what the improvement is.

I'm curious how you see whitelisting working.

The same way I see any other X working: we explicitly train the neural net to satisfy X through human feedback (perhaps using amplification, debate, learning the prior, etc). For a whitelist, we might be able to do something slightly different: we train a classifier to say whether the situation is or isn't in our whitelist, and then only query the agent when it is in our whitelist (otherwise reverting to a safe baseline). The classifier and agent share most of their weights.

Then we also do a bunch of stuff to verify that the neural net actually satisfies X (perhaps adversarial training, testing, interpretability, etc). In the whitelisting case, we'd be doing this on the classifier, if that's the route we went down.

It feels like your beliefs about what kind of methods might work for "merely way-better-than-human" systems are a big difference between you and I, which might be worth discussing more, although I don't know if it's very central to everything else we're discussing.

(Addressed this in the other comment)

Recursive Quantilizers II

Responding first to the general approach to good-enough alignment:

I think I would agree with this if you said "optimization that's at or below human level" rather than "not ridiculously far above".

Humans can be terrifying. The prospect of a system slightly smarter than any human who has ever lived, with values that are just somewhat wrong, seems not great.

Less important response: If by "not great" you mean "existentially risky", then I think you need to explain why the smartest / most powerful historical people with now-horrifying values did not constitute an existential risk.

My real objection: Your claim is about what happens after you've already failed, in some sense -- you're starting from the assumption that you've deployed a misaligned agent. From my perspective, you need to start from a story in which we're designing an AI system, that will eventually have let's say "5x the intelligence of a human", whatever that means, but we get to train that system however we want. We can inspect its thought patterns, spend lots of time evaluating its decisions, test what it would do in hypothetical situations, use earlier iterations of the tool to help understand later iterations, etc. My claim is that whatever bad optimization "sneaks through" this design process is probably not going to have much impact on the agent's performance, or we would have already caught it.

Possibly related: I don't like thinking of this in terms of how "wrong" the values are, because that doesn't allow you to make distinctions about whether behaviors have already been seen at training or not.

But really, mainly, I was making the normative claim. A culture of safety is not one in which "it's probably fine" is allowed as part of any real argument. Any time someone is tempted to say "it's probably fine", it should be replaced with an actual estimate of the probability, or a hopeful statement that combined with other research it could provide high enough confidence (with some specific sketch of what that other research would be), or something along those lines. You cannot build reliable knowledge out of many many "it's probably fine" arguments; so at best you should carefully count how many you allow yourself.

A relevant empirical claim sitting behind this normative intuition is something like: "without such a culture of safety, humans have a tendency to slide into whatever they can get away with, rather than upholding safety standards".

If your claim is just that "we're probably fine" is not enough evidence for an argument, I certainly agree with that. That was an offhand remark in an opinion in a newsletter where words are at a premium; I obviously hope to do better than that in reality.

This all seems pretty closely related to Eliezer's writing on security mindset.

Some thoughts here:

  1. I am unconvinced that we need a solution that satisfies a security-mindset perspective, rather than one that satisfies an ordinary-paranoia perspective. (A crucial point here is that the goal is not to build adversarial optimizers in the first place, rather than defending against adversarial optimization.) As far as I can tell the argument for this claim is... a few fictional parables? (Readers: Before I get flooded with examples of failures where security mindset could have helped, let me note that I will probably not be convinced by this unless you can also account for the selection bias in those examples.)
  2. I don't really see why the ML-based approaches don't satisfy the requirement of being based on security mindset. (I agree "we're probably fine" does not satisfy that requirement.) Note that there isn't a solution that is maximally security-mindset-y, the way I understand the phrase (while still building superintelligent systems). A simple argument: we always have to specify something (code if nothing else); that something could be misspecified. So here I'm just claiming that ML-based approaches seem like they can be "sufficiently" security-mindset-y.
  3. I might be completely misunderstanding the point Eliezer is trying to make, because it's stated as a metaphor / parable instead of just stating the thing directly (and a clear and obvious disanalogy is that we are dealing with the construction of optimizers, rather than the construction of artifacts that must function in the presence of optimization).
Formal Solution to the Inner Alignment Problem

N won't necessarily decrease over time, but all of the models will eventually agree with other.

Ah, right. I rewrote that paragraph, getting rid of that sentence and instead talking about the tradeoff directly.

I would have described Vanessa's and my approaches as more about monitoring uncertainty, and avoiding problems before the fact rather than correcting them afterward. But I think what you said stands too.

Added a sentence to the opinion noting the benefits of explicitly quantified uncertainty.

Formal Solution to the Inner Alignment Problem

Planned summary:

Since we probably can’t specify a reward function by hand, one way to get an agent that does what we want is to have it imitate a human. As long as it does this faithfully, it is as safe as the human it is imitating. However, in a train-test paradigm, the resulting agent may faithfully imitate the human on the training distribution but fail catastrophically on the test distribution. (For example, a deceptive model might imitate faithfully until it has sufficient power to take over.) One solution is to never stop training, that is, use an online learning setup where the agent is constantly learning from the demonstrator.

There are a few details to iron out. The agent needs to reduce the frequency with which it queries the demonstrator (otherwise we might as well just have the demonstrator do the work). Crucially, we need to ensure that the agent will never do something that the demonstrator wouldn’t have done, because such an action could be arbitrarily bad.

This paper proposes a solution in the paradigm where we use Bayesian updating rather than gradient descent to select our model, that is, we have a prior over possible models and then when we see a demonstrator action we update our distribution appropriately. In this case, at every timestep we take the N most probable models, and only take an action a with probability p if **every** one of the N models takes that action with at least probability p. The total probability of all the actions will typically be less than 1 -- the remaining probability is assigned to querying the demonstrator.

The key property here is that as long as the true demonstrator is in the top N models, then the agent never autonomously takes an action with more probability than the demonstrator would. Therefore as long as we believe the demonstrator is safe then the agent should be as well. Since the agent learns more about the demonstrator every time it queries them, over time it needs to query the demonstrator less often. Note that the higher N is, the more likely it is that the true model is one of those N models (and thus we have more safety), but also the more likely it is that we will have to query the demonstrator. This tradeoff is controlled by a hyperparameter α that implicitly determines N.


One of the most important approaches to improve inner alignment is to monitor the performance of your system online, and train to correct any problems. This paper shows the benefit of explicitly quantified, well-specified uncertainty: it allows you to detect problems _before they happen_ and then correct for them.

This setting has also been studied in <@delegative RL@>(@Delegative Reinforcement Learning@), though there the agent also has access to a reward signal in addition to a demonstrator.

Load More