ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.

I thought I was pretty clear in the post that I don't have anything against MIRI. I guess if I were to provide feedback, the one thing I most wish MIRI would do more is hire additional researchers—I think MIRI currently has too high of a hiring bar.

The fact that large models interpret the "HHH Assistant" as such a character is interesting

Yep, that's the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic. Furthermore, I would say the key result is that bigger models and those trained with more RLHF interpret the “HHH Assistant” character as more agentic. I think that this is concerning because:

  1. what it implies about capabilities—e.g. these are exactly the sort of capabilities that are necessary to be deceptive; and
  2. because I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.

How do we know that it has expressed a preference, and not simply made a prediction about what the "correct" choice is?

I think that is very likely what it is doing. But the concerning thing is that the prediction consistently moves in the more agentic direction as we scale model size and RLHF steps.

For example, further to the right (higher % answer match) on the "Corrigibility w.r.t. ..." behaviors seems to mean showing less corrigible behavior.

No, further to the right is more corrigible. Further to the right is always “model agrees with that more.”

In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it's possible to get models to do basically anything you want with the right prompt, but that the model's tendency to do X generally increases with model scale and RLHF steps.

That's fair—perhaps “messy” is the wrong word there. Maybe “it's always weirder than you think”?

(Edited the post to “weirder.”)

I don't understand the new unacceptability penalty footnote. In both of the terms, there is no conditional sign. I presume the comma is wrong?

They're unconditional, not conditional probabilities. The comma is just for the exists quantifier.

Also, for me \mathbb{B} for {True, False} was not standard, I think it should be defined.


From my perspective, training stories are focused pretty heavily on the idea that justification is going to come from a style more like heavily precedented black boxes than like cognitive interpretability

I definitely don't think this—in fact, I tend to think that cognitive interpretability is probably the only way we can plausibly get high levels of confidence in the safety of a training process. From “How do we become confident in the safety of a machine learning system?”:

Nevertheless, I think that transparency-and-interpretability-based training rationales are some of the most exciting, as unlike inductive bias analysis, they actually provide feedback during training, potentially letting us see problems as they arise rather than having to get everything right in advance.

See also: “A transparency and interpretability tech tree

The idea is that we're thinking of pseudo-inputs as “predicates that constrain X” here, so, for , we have .

Load More