Evan Hubinger

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic leading work on model organisms of misalignment. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:


Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


I am very confused now what you believe. Obviously training selects for low loss algorithms... that's, the whole point of training? I thought you were saying that training doesn't select for algorithms that internally optimize for loss, which is true, but it definitely does select for algorithms that in fact get low loss.

I mean "training signal" quite broadly there to include anything that might affect the model's ability to preserve its goals during training—probably I should have just used a different phrase, though I'm not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.

As an aside, I think this is more about data instead of "how easy is it to implement."

This seems confused to me—I'm not sure that there's a meaningful sense in which you can say one of data vs. inductive biases matters "more." They are both absolutely essential, and you can't talk about what algorithm will be learned by a machine learning system unless you are engaging both with the nature of the data and the nature of the inductive biases, since if you only fix one and not the other you can learn essentially any algorithm.

Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.

To be clear, I'm not saying that the inductive biases that matter here are necessarily unique to ANNs. In fact, they can't be: by Occam's razor, simplicity bias is what gets you good generalization, and since both human neural networks and artificial neural networks can often achieve good generalization, they have to be both be using a bunch of shared simplicity bias.

The problem is that pure simplicity bias doesn't actually get you alignment. So even if humans and AIs share 99% of inductive biases, what they're sharing is just the obvious simplicity bias stuff that any system capable of generalizing from real-world data has to share.

You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained.

I mean, certainly there is a strong pressure to do well in training—that's the whole point of training. What there isn't strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.

To be clear, here are some things that I think:

  • The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
  • There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
  • Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
  • Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases).
  • Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
  • Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.

I have never heard anyone talk about this frame

I think probably that's just because you haven't talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe's report. In fact, I suspect he'd probably have some more thoughts here on this—I think he's not fully sold on my framing above.

So this feels like a motte-and-bailey

I agree that there are some people that might defend different claims than I would, but I don't think I should be responsible for those claims. Part of why I'm excited about Joe's report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it's easier to evaluate that position in totality. If you have disagreements with my position, with Joe's position, or with anyone else's position, that's obviously totally fine—but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.

It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.

No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.

I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.

I do think this is a fine way to reason about things. Here's how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that's quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.

If anything, I've taken my part of the discussion from Twitter to LW.

Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a "goal", they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.

Yes, I think we agree there. But that doesn't imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn't memorize the entire distribution in the weights either.

I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

Regardless, some quick thoughts:

[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]

and in comparison, the "honest" / direct solution looks like:

[figure out how to do well at training] [actually do well at training]

I think this is a mischaracterization of the argument. The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training. So a more accurate comparison would be:

Deceptive model: [figure out how to do well at training] [actually do well at training]

Sycophantic model: [figure out how to do well at training] [actually do well at training]

Aligned model: [figure out how to be aligned] [actually be aligned]

Notably, the deceptive and sycophantic models are the same! But the difference is that they look different when we break apart the "figure out how to do well at training" part. We could do the same breakdown for the sycophantic model, which might look something like:

Sycophantic model: [load in some hard-coded specification of what it means to do well in training] [figure out how to execute on that specification in this environment] [actually do well at training]

The problem is that figuring out how to do well at training is actually quite hard, and deceptive alignment might make that problem easier by reducing it to the (potentially) simpler/easier problem of figuring out how to accomplish <insert any long-term goal here>. Whereas the sycophantic model just has to memorize a bunch of stuff about training that the deceptive model doesn't have to.

The point is that you can't just say "well, deceptive alignment results in the model trying to do well in training, so why not just learn a model that starts by trying to do well in training" for the same reason that you can't just say "well, deceptive alignment results in the model outputting this specific distribution, so why not just learn a model that memorizes that exact distribution". The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.

Also, another point that I'd note here: the sycophantic model isn't actually desirable either! So long as the deceptive model beats the aligned model in terms of the inductive biases, it's still a concern, regardless of whether it beats the sycophantic model or not. I'm pretty unsure which is more likely between the deceptive and sycophantic models, but I think both pretty likely beat the aligned model in most cases that we care about. But I'm more optimistic that we can find ways to address sycophantic models than deceptive models, such that I think the deceptive models are more of a concern.

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

That's a lot of nuance that you're trying to convey to the general public, which is a notoriously hard thing to do.

This really needs to be shouted from the rooftops.

I disagree. I think it's important that we shout from the rooftops that the existential risk from AI is real, but I disagree that we should shout from the rooftops that a sufficiently good pause would solve it (even though I agree with Paul that it is true). I talk about this in this comment.

Historically, I think that a lot of causes have been hurt by a sort of purity-testing where scientists are forced to endorse the most extreme policy, even if it's not the best policy, on the idea that it would solve the problem in theory if you had a magic button that enacted it. Consider, for example, the idea that climate scientists should all have to endorse the idea that, if we ended capitalism, it would solve climate change. Though true, I do not think that would help the cause of climate change! Even if climate change were enough of an existential risk that it was worth sacrificing our entire economy for it (as is maybe true of AI risk), it would still not be the case that advocating for that would be at all helpful, because there are much more effective ways of addressing climate change that starting a communist revolution.

I think everyone should be clear about what they think the risks are, but I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.

I agree that it is important to be clear about the potential for catastrophic AI risk, and I am somewhat disappointed in the answer above (though I think calling "I don't know" lying is a bit of a stretch). But on the whole, I think people have been pretty upfront about catastrophic risk, e.g. Dario has given an explicit P(doom) publicly, all the lab heads have signed the CAIS letter, etc.

Notably, though, that's not what the original post is primarily asking for: it's asking for people to clearly state that they agree that we should pause/stop AI development, not to clearly state that that they think AI poses a catastrophic risk. I agree that people should clearly state that they think there's a catastrophic risk, but I disagree that people should clearly state that they think we should pause.

Primarily, that's because I don't actually think trying to get governments to enact some sort of a generic pause would make good policy. Analogizing to climate change, I think getting scientists to say publicly that they think climate change is a real risk helped the cause, but putting pressure on scientists to publicly say that environmentalism/degrowth/etc. would solve the problem has substantially hurt the cause (despite the fact that a magic button that halved consumption would probably solve climate change).

Load More