ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


Yeah, I agree with that. I think path dependence will likely vary across training processes and that we should in fact view that as an important intervention point.

What notion of "deceptive alignment" is narrower than that?

Any definition that makes mention of the specific structure/internals of the model.

To be clear, I agree that unknown unknowns are in some sense the biggest problem in AI safety—as I talk about in the very first paragraph here.

However, I nevertheless think that focusing on deceptive alignment specifically makes a lot of sense. If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal” (where training signal doesn't necessarily mean the literal loss, just anything we're trying to get it to do), then I think most (though not all) AI existential risk scenarios that aren't solved by iterative design/standard safety engineering/etc. include that as a component. Certainly, I expect that all of my guesses for exactly how deceptive alignment might be developed, what it might look like internally, etc. are likely to be wrong—and this is one of the places where I think unknowns really become a problem—but I still expect that if we're capable of looking back and judging “was deceptive alignment part of the problem here” in situations where things go badly we'll end up concluding yes (I'd probably put ~60% on that).

Furthermore, I think there's a lot of value in taking the most concerning concrete problems that we can yet come up with and tackling them directly. Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research—and for obvious reasons I think it's most valuable to start with the most concerning concrete failure modes we're aware of. It's extremely hard to do good work on unknown unknowns directly—and additionally I think our modal guess for what such unknown unknowns might look like is some variation of the sorts of problems that already seem the most damning. Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it's pretty important to have some idea of what we might want to use those sorts of tools for when developing them, and working on concrete failure modes is extremely important to that.

(Moderation note: added to the Alignment Forum from LessWrong.)

Yeah, that's a good point—I agree that the thing I said was a bit too strong. I do think there's a sense in which the models you're describing seem pretty deceptive, though, and quite dangerous, given that they have a goal that they only pursue when they think nobody is looking. In fact, the sort of model that you're describing is exactly the sort of model I would expect a traditional deceptive model to want to gradient hack itself into, since it has the same behavior but is harder to detect.

To be clear, I think I basically agree with everything in the comment chain above. Nevertheless, I would argue that these sorts of experiments are worth running anyway, for the sorts of reasons that I outline here.

The most common, these days, is some variant of “train an AI to help with aligning AI”. Sometimes it’s “train an AI to interpret the internals of another AI”, sometimes it’s “train an AI to point out problems in another AI’s plan”, sometimes it’s “train an AI to help you design aligned AI”, etc. I would guess about 75% of newcomers from ML suggest some such variant as their first idea.

I don't think these are crazy or bad ideas at all—I'd be happy to steelman them with you at some point if you want. Certainly, we don't know how to make any of them work right now, but I think they are all reasonable directions to go down if one wants to work on the various open problems related to them. The problem—and this is what I would say to somebody if they came to me with these ideas—is that they're not so much “ideas for how to solve alignment” so much as “entire research topics unto themselves.”

Even agreeing that no additional complexity is required to rederive that it should try to be deceptive (assuming it has situational awareness of the training process and long-term goals which aren't aligned with ours), to be deceptive successfully, it then needs to rederive what our goals are, so that it can pursue them instrumentally. I'm arguing that the ability to do this in the AI would require additional complexity compared to a AI that doesn't need to rederive the content of this goal (that is, our goal) at every decision.

My argument is that the only complexity necessary to figure out what our goals are is the knowledge of what our goals are. And that knowledge is an additional complexity that should be present in all models, not just deceptive models, since all models should need to understand those sorts of facts about the world.

I guess I'm arguing that the additional complexity in the deceptive model that allows it to rederive our goals at every forward pass compensates for the additional complexity in the corrigible model that has the our goals hard-coded.

No additional complexity is required to rederive that it should try to be deceptive—it's a convergent instrumental goal that falls out of any generic optimization process that essentially all models should have to have anyway to be able to get good performance on tasks that require optimization.

It needs no such pointer in the weights, only in the activations. The deceptive model has to build such a pointer at runtime, but it doesn't need to have it hardcoded, whereas the corrigible model needs it to be hardcoded.

Load More