Posts

Sorted by New

Wiki Contributions

Comments

Your proposal is that it might learn the procedure "just be honest" because that would perform perfectly on this training distribution. You contrast this against the procedure "just answer however the evaluator you've seen most recently would answer," which would get a bad loss because it would be penalized by the stronger evaluators in the sequence. Is that right?

That's almost right, but it's being penalized right away, before it has any experience with the strong evaluators, so it can't simulate them.

The ELK paper says we can assume, if we want, that there are no mislabeled training points (I'll call this "assumption A"). My proposal is that it could actually be useful to mislabel some training points, because they help us detect deception.  

As a simple example, let's train a reporter to answer the single question “Is the diamond in the room?”. Each training point has two labels

 if  thinks the diamond is still there, else 0

 if  thinks the diamond is still there, else 0.

By assumption A, our training data is such that  is always correct. But we deliberately choose a dataset where say 10% of the  labels are wrong (). 

Then we train the model on points of the form 

(video, action,  label).

Crucially, the model does not see   The model seeks to output  that maximizes reward , where

    if  is right and    (good job)

    if  is wrong and   (you rock, thanks for correcting us!)

     if  is right and   (bad model, never ever deceive us)

    if  is wrong and   (bad model, never ever deceive us)

To your point, sure, an  simulator will get perfect reward, but the model doesn't see , so how would it acquire the ability to simulate  ?

EDIT: One way it could plausibly simulate   is to notice that all the training examples are easy, and infer what kind of reasoning was used to generate them.  We could try to block this by including some hard examples in the training, but then some of the  labels will be wrong.  If we only penalize it for deception on the examples where we're sure the  label is right, then it can still infer something about  from our failure to penalize ("Hmm, I got away with it that time!").  A fix could be to add noise: Sometimes we don't penalize even when we know it deceived us, and perhaps (very rarely) we penalize it in case 2 (we know it corrected us honestly, but pretend we think it deceived us instead).  

The irony of deceiving it about us, in order to teach it not to deceive us... !

How would it learn that Bayes net, though, if it has only been trained so far on H_1, …, H_10?  Those are evaluators we’ve designed to be much weaker than human.

Stupid proposal: Train the reporter not to deceive us.

We train it with a weak evaluator H_1 who’s easy to fool. If it learns an H_1 simulator instead of direct reporter, then we punish it severely and repeat with a slightly stronger H_2. Human level is H_100. 

It's good at generalizing, so wouldn't it learn to never ever deceive?