Better priors as a safety problem

paulfchristiano

Fitting a neural net implicitly uses a “wrong” prior. This makes neural nets more data hungry and makes them generalize in ways we don’t endorse, but it’s not clear whether it’s an alignment problem.

After all, if neural nets are what works, then both the aligned and unaligned AIs will be using them. It’s not clear if that systematically disadvantages aligned AI.

Unfortunately I think it’s an alignment problem:

I think the neural net prior may work better for agents with certain kinds of simple goals, as described in Inaccessible Information. The problem is that the prior mismatch may bite harder for some kinds of questions, and some agents simply never need to answer those hard questions.
I think that Solomonoff induction generalizes catastrophically because it becomes dominated by consequentialists who use better priors.

In this post I want to try to build some intuition for this problem, and then explain why I’m currently feeling excited about learning the right prior.

Indirect specifications in universal priors

We usually work with very broad “universal” priors, both in theory (e.g. Solomonoff induction) and in practice (deep neural nets are a very broad hypothesis class). For simplicity I’ll talk about the theoretical setting in this section, but I think the points apply equally well in practice.

The classic universal prior is a random output from a random stochastic program. We often think of the question “which universal prior should we use?” as equivalent to the question “which programming language should we use?” but I think that’s a loaded way of thinking about it — not all universal priors are defined by picking a random program.

A universal prior can never be too wrong — a prior P is universal if, for any other computable prior Q, there is some constant c such that, for all x, we have P(x) > c Q(x). That means that given enough data, any two universal priors will always converge to the same conclusions, and no computable prior will do much better than them.

Unfortunately, universality is much less helpful in the finite data regime. The first warning sign is that our “real” beliefs about the situation can appear in the prior in two different ways:

Directly: if our beliefs about the world are described by a simple computable predictor, they are guaranteed to appear in a universal prior with significant weight.
Indirectly: the universal prior also “contains” other programs that are themselves acting as priors. For example, suppose I use a universal prior with a terribly inefficient programming language, in which each character needed to be repeated 10 times in order for the program to do anything non-trivial. This prior is still universal, but it’s reasonably likely that the “best” explanation for some data will be to first sample a really simple interpret for a better programming language, and then draw a uniformly randomly program in that better programming language.

(There isn’t a bright line between these two kinds of posterior, but I think it’s extremely helpful for thinking intuitively about what’s going on.)

Our “real” belief is more like the direct model — we believe that the universe is a lawful and simple place, not that the universe is a hypothesis of some agent trying to solve a prediction problem.

Unfortunately, for realistic sequences and conventional universal priors, I think that indirect models are going to dominate. The problem is that “draw a random program” isn’t actually a very good prior, even if the programming language is OK— if I were an intelligent agent, even if I knew nothing about the particular world I lived in, I could do a lot of a priori reasoning to arrive at a much better prior.

The conceptually simplest example is “I think therefore I am.” Our hypotheses about the world aren’t just arbitrary programs that produce our sense experiences— we restrict attention to hypotheses that explain why we exist and for which it matters what we do. This rules out the overwhelming majority of programs, allowing us to assign significantly higher prior probability to the real world.

I can get other advantages from a priori reasoning, though they are a little bit more slippery to talk about. For example, I can think about what kinds of specifications make sense and really are most likely a priori, rather than using an arbitrary programming language.

The upshot is that an agent who is trying to do something, and has enough time to think, actually seems to implement a much better prior than a uniformly random program. If the complexity of specifying such an agent is small relative to the prior improbability of the sequence we are trying to predict, then I think the universal prior is likely to pick out the sequence indirectly by going through the agent (or else in some even weirder way).

I make this argument in the case of Solomonoff induction in What does the universal prior actually look like? I find that argument pretty convincing, although Solomonoff induction is weird enough that I expect most people to bounce off that post.

I make this argument in a much more realistic setting in Inaccessible Information. There I argue that if we e.g. use a universal prior to try to produce answers to informal questions in natural language, we are very likely to get an indirect specification via an agent who reasons about how we use language.

Why is this a problem?

I’ve argued that the universal prior learns about the world indirectly, by first learning a new better prior. Is that a problem?

To understand how the universal prior generalizes, we now need to think about how the learned prior generalizes.

The learned prior is itself a program that reasons about the world. In both of the cases above (Solomonoff induction and neural nets) I’ve argued that the simplest good priors will be goal-directed, i.e. will be trying to produce good predictions.

I have two different concerns with this situation, both of which I consider serious:

Bad generalizations may disadvantage aligned agents. The simplest version of “good predictions” may not generalize to some of the questions we care about, and may put us at a disadvantage relative to agents who only care about simpler questions. (See Inaccessible Information.)
Treacherous behavior. Some goals might be easier to specify than others, and a wide range of goals may converge instrumentally to “make good predictions.” In this case, the simplest programs that predict well might be trying to do something totally unrelated, when they no longer have instrumental reasons to predict well (e.g. when their predictions can no longer be checked) they may do something we regard as catastrophic.

I think it’s unclear how serious these problems are in practice. But I think they are huge obstructions from a theoretical perspective, and I think there is a reasonable chance that this will bite us in practice. Even if they aren’t critical in practice, I think that it’s methodologically worthwhile to try to find a good scalable solution to alignment, rather than having a solution that’s contingent on unknown empirical features of future AI.

Learning a competitive prior

Fundamentally, I think our mistake was building a system that uses the wrong universal prior, one that fails to really capture our beliefs. Within that prior, there are other agents who use a better prior, and those agents are able to outcompete and essentially take over the whole system.

I’ve considered lots of approaches that try to work around this difficulty, taking for granted that we won’t have the right prior and trying to somehow work around the risky consequences. But now I’m most excited about the direct approach: give our original system the right prior so that sub-agents won’t be able to outcompete it.

This roughly tracks what’s going on in our real beliefs, and why it seems absurd to us to infer that the world is a dream of a rational agent—why think that the agent will assign higher probability to the real world than the “right” prior? (The simulation argument is actually quite subtle, but I think that after all the dust clears this intuition is basically right.)

What’s really important here is that our system uses a prior which is competitive, as evaluated by our real, endorsed (inaccessible) prior. A neural net will never be using the “real” prior, since it’s built on a towering stack of imperfect approximations and is computationally bounded. But it still makes sense to ask for it to be “as good as possible” given the limitations of its learning process — we want to avoid the situation where the neural net is able to learn a new prior which predictably to outperforms the outer prior. In that situation we can’t just blame the neural net, since it’s demonstrated that it’s able to learn something better.

In general, I think that competitiveness is a desirable way to achieve stability — using a suboptimal system is inherently unstable, since it’s easy to slip off of the desired equilibrium to a more efficient alternative. Using the wrong prior is just one example of that. You can try to avoid slipping off to a worse equilibrium, but you’ll always be fighting an uphill struggle.

Given that I think that finding the right universal prior should be “plan A.” The real question is whether that’s tractable. My current view is that it looks plausible enough (see Learning the prior for my current best guess about how to approach it) that it’s reasonable to focus on for now.

Better priors as a safety problem was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

Summary for the Alignment Newsletter (also includes a summary for Learning the prior):

Any machine learning algorithm (including neural nets) has some inductive bias, which can be thought of as its “prior” over what the data it will receive will look like. In the case of neural nets (and any other general ML algorithm to date), this prior is significantly worse than human priors, since it does not encode e.g. causal reasoning or logic. Even if we avoid priors that depended on us previously seeing data, we would still want to update on facts like “I think therefore I am”. With a better prior, our ML models would be able to learn more sample efficiently. While this is so far a capabilities problem, there are two main ways in which it affects alignment.

First, as argued in <@Inaccessible information@>, the regular neural net prior will learn models which can predict accessible information. However, our goals depend on inaccessible information, and so we would have to do some “extra work” in order to extract the inaccessible information from the learned models in order to build agents that do what we want. This leads to a competitiveness hit, relative to agents whose goals depend only on accessible information, and so during training we might expect to consistently get agents whose goals depend on accessible information instead of the goals we actually want.

Second, since the regular neural net prior is so weak, there is an incentive to learn a better prior, and then have that better prior perform the task. This is effectively an incentive for the neural net to learn a <@mesa optimizer@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@), which need not be aligned with us, and so would generalize differently than we would, potentially catastrophically.

Let’s formalize this a bit more. We have some evidence about the world, given by a dataset D = {(x1, y1), (x2, y2), ...} (we assume that it’s a prediction task -- note that most self-supervised tasks can be written in this form). We will later need to make predictions on the dataset D* = {x1*, x2*, …}, which may be from a “different distribution” than D (e.g. D might be about the past, while D* is about the future). We would like to use D to learn some object Z that serves as a “prior”, such that we can then use Z to make good predictions on D*.

The standard approach which we might call the “neural net prior” is to train a model to predict y from x using the dataset D, and then apply that model directly to D*, hoping that it transfers correctly. We can inject some human knowledge by finetuning the model using human predictions on D*, that is by training the model on {(x1*, H(x1*)), (x2*, H(x2*)), …}. However, this does not allow H to update their prior based on the dataset D. (We assume that H cannot simply read through all of D, since D is massive.)

What we’d really like is some way to get the predictions H would make if they could update on dataset D. For H, we’ll imagine that a prior Z is given by some text describing e.g. rules of logic, how to extrapolate trends, some background facts about the world, empirical estimates of key quantities, etc. I’m now going to talk about priors over the prior Z, so to avoid confusion I’ll now call an individual Z a “background model”.

The key idea here is to structure the reasoning in a particular way: H has a prior over background models Z, and then given Z, H’s predictions for any given x_i are independent of any all the other (x, y) pairs. In other words, once you’ve fixed your background model of the world, your prediction of y_i doesn’t depend on the value of y_j for some other x_j. Or to explain it a third way, this is like having a set of hypotheses {Z}, and then updating on each element of D one by one using Bayes Rule. In that case, the log posterior of a particular background model Z is given by log Prior(Z) + sum_i log P(y_i | x_i, Z) (neglecting a normalization constant).

The nice thing about this is the individual terms Prior(Z) and P(y_i | x_i, Z) are all things that humans can do, since they don’t require the human to look at the entire dataset D. In particular, we can learn Prior(Z) by presenting humans with a background model, and having them evaluate how likely it is that the background model is accurate. Similarly, P(y_i | x_i, Z) simply requires us to have humans predict y_i under the assumption that the background facts in Z are accurate. So, we can learn models for both of these using neural nets. We can then find the best background model Z* by optimizing the equation above, representing what H would think was the most likely background model after updating on all of D. We can then learn a model for P(y*_i | x*_i, Z*) by training on human predictions of y*_i given access to Z*.

This of course only gets us to human performance, which requires relatively small Z. If we want to have large background models allowing for superhuman performance, we can use iterated amplification and debate to learn Prior(Z) and P(y | x, Z). There is some subtlety about how to represent Z that I won’t go into here.

Planned opinion:

It seems to me like solving this problem has two main benefits. First, the model our AI system learns from data (i.e. the Z*) is interpretable, and in particular we should be able to extract the previously inaccessible information that is relevant to our goals (which helps us build AI systems that actually pursue those goals). Second, AI systems built in this way are incentivized to generalize in the same way that humans do: in the scheme above, we learn from one distribution D, and then predict on a new distribution D*, but every model learned with a neural net is only used on the same distribution it was trained on.

Of course, while the AI system is _incentivized_ to generalize the way humans do, that does not mean it _will_ generalize as humans do -- it is still possible that the AI system internally “wants” to gain power, and only instrumentally answers questions the way humans would answer them. So inner alignment is still a potential issue. It seems possible to me that whatever techniques we use for dealing with inner alignment will also deal with the problems of unsafe priors as a side effect, in which case we may not end up needing to implement human-like priors. (As the post notes, it may be much more difficult to use this approach than to do the standard “neural net prior” approach described above, so it would be nice to avoid it.)

This will probably go out in the newsletter 9 days from now instead of the next one, partially because I have two things to highlight and I'd rather send them out separately, and partially because I'm not confident my summary / opinion are correct and I want to have more time for people to point out flaws.