Learning the prior and generalization

[-]Rohin Shah5y*130

First off, I want to note that it is important that datasets and data points do not come labeled with "true distributions" and you can't rationalize one for them after the fact. But I don't think that's an important point in the case of i.i.d. data.

Thus, to the extent that we're able to train models that can do a good job for i.i.d. x′ [...] it's because there's an implicit prior there that's assigning a fairly high probability to the actual distribution you used to sample the data from rather than any other of the infinitely many possible distributions. Even in the i.i.d. case, therefore, there's still a real and meaningful sense in which your performance is coming from the machine learning prior.

Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn't matter, and even if it did, I struggle to see how it has any safety implications -- we'd just find that our ML model is completely unable to get any validation performance and so we wouldn't deploy.

However, at least in the context of mesa-optimization, you can never really get i.i.d. data thanks to fundamental distributional shifts such as the the very fact that one set of data points is used in training and one set of data points is used in deployment.

Want to note I agree that you never actually get i.i.d. data and so you can't fully solve inner alignment by making everything i.i.d. (also even if you did it wouldn't necessarily solve inner alignment).

In any event, I think that striving for verifiability is a pretty good goal that I expect to have real benefits if it can be achieved—and I think it's a much more well-specified goal than i.i.d.-ness.

I don't really see this. My read of this post is that you introduced "verifiability", argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it's better specified than i.i.d. because... actually I'm not sure why, but possibly because we can never actually get i.i.d. in practice?

If that's right, then I disagree. The way in which we lose i.i.d. in practice is stuff like "the system could predict the pseudorandom number generator" or "the system could notice how much time has passed" (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can't verify your system if it can predict which outputs you will verify, you can't verify your system if it varies its answer based on how much time has passed, you can't verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don't see why verifiability does any better.

[-]evhub5y50

First off, I want to note that it is important that datasets and data points do not come labeled with "true distributions" and you can't rationalize one for them after the fact. But I don't think that's an important point in the case of i.i.d. data.

I agree—I pretty explicitly make that point in the post.

Why not just apply the no-free-lunch theorem? It says the same thing. Also, why do we care about this? Empirically the no-free-lunch theorem doesn't matter, and even if it did, I struggle to see how it has any safety implications -- we'd just find that our ML model is completely unable to get any validation performance and so we wouldn't deploy.

I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it's easier to understand the text just by reading it.

The reason I care, though, is because the fact that the performance is coming from the implicit ML prior means that if that prior is malign then even in the i.i.d. case you can still get malign optimization.

I don't really see this. My read of this post is that you introduced "verifiability", argued that it has exactly the same properties as i.i.d. (since i.i.d. also gives you average-case guarantees), and then claimed it's better specified than i.i.d. because... actually I'm not sure why, but possibly because we can never actually get i.i.d. in practice?

That's mostly right, but the point is that the fact that you can't get i.i.d. in practice matters because it means you can't get good guarantees from it—whereas I think you can get good guarantees from verifiability.

If that's right, then I disagree. The way in which we lose i.i.d. in practice is stuff like "the system could predict the pseudorandom number generator" or "the system could notice how much time has passed" (e.g. via RSA-2048). But verifiability has the same obstacles and more, e.g. you can't verify your system if it can predict which outputs you will verify, you can't verify your system if it varies its answer based on how much time has passed, you can't verify your system if humans will give different answers depending on random background variables like how hungry they are, etc. So I don't see why verifiability does any better.

I agree that you can't verify your model's answers if it can predict which outputs you will verify (though I don't think getting true randomness will actually be that hard)—but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.

[-]Rohin Shah5y40

I agree that this is just the no free lunch theorem, but I generally prefer explaining things fully rather than just linking to something else so it's easier to understand the text just by reading it.

Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem -- I initially thought you were arguing "since there is no 'true' distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice".

but the others are notably not problems for verifiability despite being problems for i.i.d.-ness. If the model gives answers where the tree of reasoning supporting those answers depends on how much time has passed or how hungry the human was, then the idea is to reject those answers. The point is to produce a mechanism that allows you to verify justifications for correct answers to the questions that you care about.

I still don't see the distinction. Let's be particularly concrete.

Say I have D = timeseries data of new COVID cases per day for the last 14 days, and I want to predict D' = timeseries data of new COVID cases per day for the next 14 days. Maybe our Z* ends up being "on day t there will be exp(0.1 * days since Feb 17) new cases".

Our model for predicting on D' is trained via supervised learning on human predictions on randomly sampled data points of D'. We then use it to predict on other randomly sampled data points of D', thus using it in a nominally i.i.d. setting. Now, you might be worried that we learn a model that thinks something like "if RSA-2048 is not factored, then plug it into the formula from Z* and report that, otherwise say there will be a billion cases and once all the humans hide away in their houses I'll take over the world", which leverages one of the ways in which nominally i.i.d. data is not actually i.i.d.. How does verifiability help with this problem?

Perhaps you'd say, "with verifiability, the model would 'show its work', thus allowing the human to notice that the output depends on RSA-2048, and so we'd see that we have a bad model". But this seems to rest on having some sort of interpretability mechanism -- I feel like you're not just saying "rather than i.i.d., we need perfect interpretability and that will give us better guarantees", but I don't know what you are saying.

[-]evhub5y20

Fwiw, it took me a few re-reads to realize you were just arguing for the no-free-lunch theorem -- I initially thought you were arguing "since there is no 'true' distribution for a dataset, datasets can never be i.i.d., and so the theorems never apply in practice".

Hmmm... I'll try to edit the post to be more clear there.

How does verifiability help with this problem?

Because rather than just relying on doing ML in an i.i.d. setting giving us the guarantees that we want, we're forcing the guarantees to hold by actually randomly checking the model's predictions. From the perspective of a deceptive model, knowing that its predictions will just be trusted because the human thinks the data is i.i.d. gives it a lot more freedom than knowing that its predictions will actually be checked at random.

Perhaps you'd say, "with verifiability, the model would 'show its work', thus allowing the human to notice that the output depends on RSA-2048, and so we'd see that we have a bad model". But this seems to rest on having some sort of interpretability mechanism

There's no need to invoke interpretability here—we can train the model to give answers + justifications via any number of different mechanisms including amplification, debate, etc. The point is just to have some way to independently check the model's answers to induce i.i.d.-like guarantees.

[-]Rohin Shah5y*30

we're forcing the guarantees to hold by actually randomly checking the model's predictions.

How is this different from evaluating the model on a validation set?

I certainly agree that we shouldn't just train a model and assume it is good; we should be checking its performance on a validation set. This is standard ML practice and is necessary for the i.i.d. guarantees to hold (otherwise you can't guarantee that the model didn't overfit to the training set).

[-]evhub5y30

Sure, but at the point where you're randomly deciding whether to collect ground truth for a data point and check the model on it (that is, put it in the validation data) or collect new data using the model to make predictions, you have verifiability. Importantly, though, you can get verifiability without doing that—including if the data isn't actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model's output against some ground truth. In either situation, though, part of the point that I'm making is that the safety benefits are coming from the verifiability part not the i.i.d. part—even in the simple example of i.i.d.-ness giving you validation data, what's mattering is that the validation and deployment data are i.i.d. (because that's what gives you verifiability), but not whether the training and validation/deployment data are i.i.d.

[-]Rohin Shah5y30

Importantly, though, you can get verifiability without doing that—including if the data isn't actually i.i.d. (e.g. there are detectable differences between training and deployment data) but you still decide randomly whether to check the model's output against some ground truth.

This is taking the deployment data, and splitting it up into validation vs. prediction sets that are i.i.d. (via random sampling), and then applying the i.i.d. theorem on results from the validation set to make guarantees on the prediction set. I agree the guarantees apply even if the training set is not from the same distribution, but the operation you're doing is "make i.i.d. samples and apply the i.i.d. theorem".

At this point we may just be debating semantics (though I do actually care about it in that I'm pretty opposed to new jargon when there's perfectly good ML jargon to use instead).

[-]evhub5y30

Alright, I think we're getting closer to being on the same page now. I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data. Second, it opens the door to an even further relaxation, which is that you can do the validation while looking at the model's output. That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth. At that point, though, it really stops looking anything like the standard i.i.d. setup, which is why I'm hesitant to just call it “validation/deployment i.i.d.” or something.

[-]Rohin Shah5y70

I think it's interesting to think about what the relaxation from requiring training/validation/deployment to be i.i.d. to just requiring that validation/deployment be i.i.d. gets us, though. First, it's an argument that we shouldn't be that worried about whether the training data is i.i.d. relative to the validation/deployment data.

Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)

That is, if the model can provide a justification which convinces you that your ground truth generator would have produced the same output, that's just as good as actually checking against the ground truth.

This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.

I'm curious what an algorithm might be that leverages this relaxation.

[-]evhub5y30

Fair enough. In practice you still want training to also be from the same distribution because that's what causes your validation performance to be high. (Or put differently, training/validation i.i.d. is about capabilities, and validation/deployment i.i.d. is about safety.)

Yep—agreed.

This seems to rely on an assumption that "human is convinced of X" implies "X"? Which might be fine, but I'm surprised you want to rely on it.

I'm curious what an algorithm might be that leverages this relaxation.

Well, I'm certainly concerned about relying on assumptions like that, but that doesn't mean there aren't ways to make it work. Approaches like debate and approval-based amplification already rely on very similar assumptions—for example, for debate to work it needs to be the case that being convinced of X at the end of the debate implies X. Thus, one way to leverage this relaxation is just to port those approaches over to this setting. For example, you could train $f (x | Z)$ via debate over what $H (x | Z)$ would do if $H$ could access the entirety of $Z$ , then randomly do full debate rollouts during deployment. Like I mention in the post, this still just gives you average-case guarantees, not worst-case guarantees, though average-case guarantees are still pretty good and you can do a lot with them if you can actually get them.

[-]William_S5y40

for extremely large which are represented only implicitly as in Paul's post, we might not always check whether the model matches the ground truth by actually generating the ground truth and instead just ask the human to verify the answer given $Z$

I'm not sure what "just ask the human to verify the answer given $Z$ " looks like, for implicitly represented $Z$

[-]evhub5y20

There are lots of ways to allow to interface with an implicitly represented $Z$ , but the one Paul describes in “Learning the Prior” is to train some model $M z (\cdot, z)$ which represents $Z$ implicitly by responding to human queries about $Z$ (see also “Approval-maximizing representations” which describes how a model like $M z$ could represent $Z$ implicitly as a tree).

Once $H$ can interface with $Z$ , checking whether some answer is correct given $Z$ is at least no more difficult than producing an answer given $Z$ —since $H$ can just produce their answer then check it against the model's using some distance metric (e.g. an autoregressive language model)—but could be much easier if there are ways for $H$ to directly evaluate how likely $H$ would be to produce that answer.

[-]William_S5y40

Right, but in the post the implicitly represented Z is used by an amplification or debate system, because it contains more information than a human can quickly read and use (so are you assuming it's simple to verify the results of amplification/debate systems?)

[-]evhub5y30

Ah, sorry, no—I was assuming you were just using whatever procedure you used previously to allow the human to interface with in that situation as well. I'll edit the post to be more clear there.

[-]William_S5y30

Okay, makes more sense now, now my understanding is that for question X, answer from ML system Y, amplification system A, verification in your quote is asking the A to answer "Would A(Z) output answer Y to question X?", as opposed to asking A to answer "X", and then checking if it equals "Y". This can at most be as hard as running the original system, and maybe could be much more efficient.

[-]evhub5y30

Yep; that's what I was imagining. It is also worth noting that it can be less safe to do that, though, since you're letting A(Z) see Y, which could bias it in some way that you don't want—I talk about that danger a bit in the context of approval-based amplification here and here.

Note that when I say “the human with access to $Z$ ” I mean through whatever means you are using to allow the human to interface with a large, implicitly represented $Z$ (which could be amplification, debate, etc.)—for more detail see “Approval-maximizing representations.” ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Learning the prior and generalization

19

The generalization problem

Nuances with generalization

Paul's approach and verifiability