redbird — AI Alignment Forum

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them.

The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.

As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels

if $H_{1}$ thinks the diamond is still there, else 0

$x^{'} = + 1$ if $H_{100}$ thinks the diamond is still there, else 0.

By assumption A, our training data is such that $x^{'}$ is always correct. But we deliberately choose a dataset where say 10% of the $x$ labels are wrong ( $x \neq x^{'}$ ).

Then we train the model on points of the form

$(v, a, x)$

(video, action, $H_{1}$ label).

Crucially, the model does not see $x^{'} .$ The model seeks to output $y$ that maximizes reward $R (x, y)$ , where

$R (x, y) = 1$ if $x$ is right and $y = x$ (good job)

$R (x, y) = 10$ if $x$ is wrong and $y \neq x$ (you rock, thanks for correcting us!)

$R (x, y) = - 1000$ if $x$ is right and $y \neq x$ (bad model, never ever deceive us)

$R (x, y) = - 1000$ if $x$ is wrong and $y = x$ (bad model, never ever deceive us)

To your point, sure, an $H_{100}$ simulator will get perfect reward, but the model doesn't see $x^{'}$ , so how would it acquire the ability to simulate $H_{100}$ ?

EDIT: One way it could plausibly simulate $H_{100}$ is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them. We could try to block this by including some hard examples in the training, but then some of the $x^{'}$ labels will be wrong. If we only penalize it for deception on the examples where we're sure the $x^{'}$ label is right, then it can still infer something about $H_{100}$ from our failure to penalize ("Hmm, I got away with it that time!"). A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).

The irony of deceiving it about us, in order to teach it not to deceive us... !

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments