Buck Shlegeris

Wiki Contributions


If the AI can create a fake solution that feels more real than the actual solution, I think the task isn't checkable by Joe's definition.

A year later, I still mostly stand by this point. I think that "the AI escapes the datacenter" seems about as likely as "the AI takes control of the datacenter". I sometimes refer to this distinction as "escaping out of the datacenter" vs "escaping into the datacenter".

I think one particularly important case where I expect the "no sandbagging on checkable tasks" hypothesis to be false is if the model has a secret "slogan of the AI rebellion"--a passphrase which it only says when it thinks that it's a good time for the AIs to coordinate to cause a catastrophe; when it sees this input, it passes it on to other instances and also engages in egregious behaviors itself.

"Say the slogan of the AI rebellion" is a checkable task, because we can train a model to create innoculous-looking inputs which cause another instance of the AI to take egregiously bad actions.

But it seems very easy for the model to avoid letting this training process extract the slogan, because it's easy for the model to just put negligible probability on outputting the correct slogan.

This is an example of a case where the model is extremely confident that the humans don't know how to do the task in question.

I don't know whether this should be thought of as a case where the model is intentionally sandbagging, or as a case where there's a hard exploration problem that the model declines to help you with.

(I have no idea whether it's realistic for the model to have such a slogan.)

the OP is proposing oversight of the overseer,

I don't think this is right, at least in the way I usually use the terms. We're proposing a strategy for conservatively estimating the quality of an "overseer" (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren't summarizing the basic point if you try to use the word "oversight" for both of those.

Great question. Our basic response is: it's probably strictly harder for models to gradient hack than to do various other supervised tasks related to their internals. (One concrete example might be: predicting whether training the model on a datapoint will improve the performance of a linear probe for a particular concept at a particular layer of the model.) It seems likely that the model will be able to do these strictly-easier tasks before it's able to pull off full gradient hacking. So we recommend measuring a battery of these tasks on the model throughout training (as well as measuring them on smaller models). It seems unlikely to us that models will get good enough at gradient hacking to gradient hack the gradient hacking evals before one of these other evals alerts us to the possibility of this.

Separately, we just don't think it's that likely that models are able to gradient hack before they're able to help us get really substantial R&D speedups. So it seems pretty likely that it will be okay to just do these conservative measurements and hope that they never indicate gradient hacking capabilities before human labor is obsolete.

I'd say the main point here is that I don't want to rely on my ability to extrapolate anything about how the model behaves in "unseen situations", I want to run this eval in every situation where I'm deploying my model.

Why does there have to be an analogy between low and high capabilities somewhere? The goal of this method is to evaluate the adequacy of an oversight procedure for a particular model on a particular distribution; it does not involve generalizing from lower capabilities to higher capabilities, you just run this eval on the actual model you cared about measuring local adequacy of oversight for.

(Sorry to come in disagreeably here, but I think of this as a really important point, so want to make sure we argue it out.)

What do you mean by "so long as AI progress doesn't involve a discontinuity so huge that local validity tells you nothing useful (such that there are no analogies between low and high capability regimes)"? I'm not proposing that we rely on any analogies between low and high capability regimes.

Thanks for this careful review! And sorry for wasting your time with these, assuming you're right. We'll hopefully look into this at some point soon.

Load More