Learning the prior and generalization

by Evan Hubinger4 min read29th Jul 202014 comments



This post is a response to Paul Christiano's post “Learning the prior.”

The generalization problem

Generally, when we train models, we often end up deploying them in situations that are distinctly different from those they were trained under. Take, for example, GPT-3. GPT-3 was trained to predict web text, not serve as a dungeon master—and the sort of queries that people present to AI dungeon are quite different than random web text—but nevertheless GPT-3 can perform quite well here because it has learned a policy which is general enough that it continues to function quite effectively in this new domain.

Relying on this sort of generalization, however, is potentially quite troublesome. If you're in a situation where your training and deployment data are in fact independently and identically distributed (i.i.d.), you can produce all sorts of nice guarantees about the performance of your model. For example, in an i.i.d. setting, you know that in the limit of training you'll get the desired behavior. Furthermore, even before the limit of training, you know that validation and deployment performance will precisely track each other such that you can bound the probability of catastrophic behavior by the incidence of catastrophic behavior on the validation data.

In a generalization setting, on the other hand, you have no such guarantees—even in the limit of training, precisely what your model does off-distribution is determined by your training process's inductive biases. In theory, any off-distribution behavior is compatible with zero training error—the only reason machine learning produces good off-distribution behavior is because it finds something like the simplest model that fits the data. As a result, however, a model's off-distribution behavior will be highly dependent on exactly what the training process's interpretation of “simpler” is—that is, its inductive biases. And relying on such inductive biases for your generalization behavior can potentially have catastrophic consequences.

Nuances with generalization

That being said, the picture I've painted above of off-distribution generalization being the problem isn't quite right. For example, consider an autoregressive model (like GPT-3) that's just trained to learn a particular distribution. Then, if I have some set of training data and a new data point , there's no test you can do to determine whether was really sampled from the same distribution as . In fact, for any and , I can always give you a distribution that could have been sampled from that assigns whatever probability I want to . Thus, to the extent that we're able to train models that can do a good job for i.i.d. —that is, that assign high probability to —it's because there's an implicit prior there that's assigning a fairly high probability to the actual distribution you used to sample the data from rather than any other of the infinitely many possible distributions (this is the no free lunch theorem). Even in the i.i.d. case, therefore, there's still a real and meaningful sense in which your performance is coming from the machine learning prior.

It's still the case, however, that actually using i.i.d. data does give you some real and meaningful guarantees—such as the ability to infer performance properties from validation data, as I mentioned previously. However, at least in the context of mesa-optimization, you can never really get i.i.d. data thanks to fundamental distributional shifts such as the the very fact that one set of data points is used in training and one set of data points is used in deployment. Paul Christiano's RSA-2048 example is a classic example of how that sort of fundamental distributional shift could potentially manifest. Both Paul and I have also written about possible solutions to this problem, but it's still a problem that you need to deal with even if you've otherwise fully dealt with the generalization problem.

Paul's approach and verifiability

The question I want to ask now, however, is the extent to which we can nevertheless at least somewhat stop relying on machine learning generalization and what benefits we might be able to get from doing so. As I mentioned, there's a sense in which we'll never fully be able to stop relying on generalization, but there might still be major benefits to be had from at least partially stopping doing so. At first, this might sound crazy—if you want to be competitive, surely you need to be able to do generalization? And I think that's true—but the question was whether we needed our machine learning models to do generalization, not whether we needed generalization at all.

Paul's recent post “Learning the prior” presents a possible way to get generalization in the way that a human would generalize while relying on significantly less machine learning generalization. Specifically, Paul's idea is to use ML to learn a set of forecasting assumptions that maximize the human's posterior estimate of the likelihood of over some training data, then generalize by learning a model that predicts human forecasts given . Paul argues that this approach is nicely i.i.d., but for the reasons mentioned above I don't fully buy that—for example, there are still fundamental distributional shifts that I'm skeptical can ever be avoided such as the fact that a deceptive model might care about some data points (e.g. the deployment ones) more than others (e.g. the training ones). That being said, I nevertheless think that there is still a real and meaningful sense in which Paul's proposal reduces the ML generalization burden in a helpful way—but I don't think that i.i.d-ness is the right way to talk about that.

Rather, I think that what's special about Paul's proposal is that it guarantees verifiability. That is, under Paul's setup, we can always check whether any answer matches the ground truth by querying the human with access to .[1] In practice, for extremely large which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human with access to to verify the answer, but regardless the point is that we have the ability to check the model's answers. This is different even than directly doing something like imitative amplification, where the only ground truth we can get in generalization scenarios is either computationally infeasible (HCH) or directly references the model itself (). One nice thing about this sort of verifiability is that, if we determine when to do the checks randomly, we can get a representative sample of the model's average-case generalization behavior—something we really can't do otherwise. Of course, we still need worst-case guarantees—but having strong average-case guarantees is still a big win.

To achieve verifiability while still being competitive across a large set of questions, however, requires being able to fully verify answers to all of those questions. That's a pretty tall order because it means there needs to exist some procedure which can justify arbitrary knowledge starting only from human knowledge and reasoning. This is the same sort of thing that amplification and debate need to be competitive, however, so at the very least it's not a new thing that we need for such approaches.

In any event, I think that striving for verifiability is a pretty good goal that I expect to have real benefits if it can be achieved—and I think it's a much more well-specified goal than i.i.d.-ness.

EDIT: I clarify a lot of stuff in the above post in this comment chain between me and Rohin.

  1. Note that when I say “the human with access to ” I mean through whatever means you are using to allow the human to interface with a large, implicitly represented (which could be amplification, debate, etc.)—for more detail see “Approval-maximizing representations.” ↩︎