Evan Hubinger


Risks from Learned Optimization

Wiki Contributions


Thoughts on safety in predictive learning

(A possible objection would be: "real-world foresighted planning" isn't a separate thing that adds to model complexity, instead it naturally falls out of other capabilities that are necessary for postdiction like "building predictive models" and "searching over strategies" and whatnot. I think I would disagree with that objection, but I don't have great certainty here.)

Yup, that's basically my objection.

Thoughts on safety in predictive learning

Sure, that's fair. But in the post, you argue that this sort of non-in-universe-processing won't happen because there's no incentive for it:

It seems like there’s no incentive whatsoever for a postdictive learner to have any concept that the data processing steps in the algorithm have any downstream impacts, besides, y’know, processing data within the algorithm. It seems to me like there’s a kind of leap to start taking downstream impacts to be a relevant consideration, and there’s nothing in gradient descent pushing the algorithm to make that leap, and there doesn’t seem to be anything about the structure of the domain or the reasoning it’s likely to be doing that would lead to making that leap, and it doesn’t seem like the kind of thing that would happen by random noise, I think.

However, if there's another “why” for why the model is doing non-in-universe-processing that is incentivized—e.g. simplicity—then I think that makes this argument no longer hold.

Thoughts on safety in predictive learning

I mean, I guess it depends on your definition of “unrelated to any anticipated downstream real-world consequences.” Does the reason “it's the simplest way to solve the problem in the training environment” count as “unrelated” to real-world consequences? My point is that it seems like it should, since it's just about description length, not real-world consequences—but that it could nevertheless yield arbitrarily bad real-world consequences.

Thoughts on safety in predictive learning

Oh, I guess you're saying something different: that even a deceptive mesa-optimizer which is entirely doing within-universe processing is nevertheless scary. So that would by definition be an algorithm with the property "no operation in the algorithm is likelier to happen vs not happen specifically because of anticipated downstream chains of causation that pass through things in the real world".

Yep, that's right.

So I can say categorically: such an algorithm won't hurt anyone (except by freak accident), won't steal processing resources, won't intervene when I go for the off-switch, etc., right?

No, not at all—just because an algorithm wasn't selected based on causing something to happen in the real world doesn't mean it won't in fact try to make things happen in the real world. In particular, the reason that I expect deception in practice is not primarily because it'll actually be selected for, but primarily just because it's simpler, and so it'll be found despite the fact that there wasn't any explicit selection pressure in favor of it. See: “Does SGD Produce Deceptive Alignment?

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

I mostly agree with what Paul said re using various techniques to improve the evaluation of to ensure you can test it on more open-ended questions. That being said, I'm more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I'm imagining a situation where the training dataset is of the narrower form you talk about such that and always agree (as in Step 3 here)—but where the deployment setting wouldn't necessarily have to be of this form, since once you're confident that you've actually learned and not e.g. , you can use it for all sorts of things that wouldn't ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).

(Also, thanks for catching the typo—it should be fixed now.)

Thoughts on safety in predictive learning

By and large, we expect learning algorithms to do (1) things that they’re being optimized to do, and (2) things that are instrumentally useful for what they’re being optimized to do, or tend to be side-effect of what they’re being optimized to do, or otherwise “come along for the ride”. Let’s call those things “incentivized”. Of course, it’s dicey in practice to declare that a learning algorithm categorically won’t ever do something, just because it’s not incentivized. Like, we may be wrong about what’s instrumentally useful, or we may miss part of the space of possible strategies, or maybe it’s a sufficiently simple thing that it can happen by random chance, etc.

In the presence of deceptive alignment, approximately any goal is possible in this setting, not just the nearby instrumental proxies that you might be okay with. Furthermore, deception need not be 4th-wall-breaking, since the effect of deception on helping you do better in the training process entirely factors through the intended output channel. Thus, I would say that within-universe mesa-optimization can be arbitrarily scary if you have no way of ruling out deception.

Discussion: Objective Robustness and Inner Alignment Terminology

(Moderation note: added to the Alignment Forum from LessWrong.)

Empirical Observations of Objective Robustness Failures

(Moderation note: added to the Alignment Forum from LessWrong.)

"Decision Transformer" (Tool AIs are secret Agent AIs)

(Moderation note: added to the Alignment Forum from LessWrong.)

Survey on AI existential risk scenarios

(Moderation note: added to the Alignment Forum from LessWrong.)

Load More