Evan Hubinger

Evan Hubinger (he/him/his) (evanjhub@gmail.com)

I am a research scientist at Anthropic leading work on model organisms of misalignment. My posts and comments are my own and do not represent Anthropic's positions, policies, strategies, or opinions.

Previously: MIRI, OpenAI

See: “Why I'm joining Anthropic

Selected work:


Conditioning Predictive Models
ML Alignment Theory Scholars Program Winter 2021
Risks from Learned Optimization

Wiki Contributions


You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained.

I mean, certainly there is a strong pressure to do well in training—that's the whole point of training. What there isn't strong pressure for is for the model to internally be trying to figure out how to do well in training. The model need not be thinking about training at all to do well on the training objective, e.g. as in the aligned model.

To be clear, here are some things that I think:

  • The model needs to figure out how to somehow output a distribution that does well in training. Exactly how well relative to the inductive biases is unclear, but generally I think the easiest way to think about this is to take performance at the level you expect of powerful future models as a constraint.
  • There are many algorithms which result in outputting a distribution that does well in training. Some of those algorithms directly reason about the training process, whereas some do not.
  • Taking training performance as a constraint, the question is what is the easiest way (from an inductive bias perspective) to produce such a distribution.
  • Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases).
  • Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
  • Comparing the deceptive to sycophantic models, the primary question is which one is an easier way (from an inductive bias perspective) to compute how to do well on the training process: directly memorizing pointers to that information in the world model, or deducing that information using the world model based on some goal.

I have never heard anyone talk about this frame

I think probably that's just because you haven't talked to me much about this. The point about whether to use a loss minimization + inductive bias constraint vs. loss constraint + inductive bias minimization was a big one that I commented a bunch about on Joe's report. In fact, I suspect he'd probably have some more thoughts here on this—I think he's not fully sold on my framing above.

So this feels like a motte-and-bailey

I agree that there are some people that might defend different claims than I would, but I don't think I should be responsible for those claims. Part of why I'm excited about Joe's report is that it takes a bunch of different isolated thinking from different people and puts it into a single coherent position, so it's easier to evaluate that position in totality. If you have disagreements with my position, with Joe's position, or with anyone else's position, that's obviously totally fine—but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.

It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.

No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.

I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.

I do think this is a fine way to reason about things. Here's how I would justify this: We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff. We could just try to directly theorize about the multivariate optimization problem, but that's quite difficult. Instead, we can take either variable as a constraint, and theorize about the univariate optimization problem subject to that constraint. We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.

If anything, I've taken my part of the discussion from Twitter to LW.

Good point. I think I'm misdirecting my annoyance here; I really dislike that there's so much alignment discussion moving from LW to Twitter, but I shouldn't have implied that you were responsible for that—and in fact I appreciate that you took the time to move this discussion back here. Sorry about that—I edited my comment.

And my response is that I think the model pays a complexity penalty for runtime computations (since they translate into constraints on parameter values which are needed to implement those computations). Even if those computations are motivated by something we call a "goal", they still need to be implemented in the circuitry of the model, and thus also constrain its parameters.

Yes, I think we agree there. But that doesn't imply that just because deceptive alignment is a way of calculating what the training process wants you to do, that you can then just memorize the result of that computation in the weights and thereby simplify the model—for the same reason SGD doesn't memorize the entire distribution in the weights either.

I really don't like all this discussion happening on Twitter, and I appreciate that you took the time to move this back to LW/AF instead. I think Twitter is really a much worse forum for talking about complex issues like this than LW/AF.

Regardless, some quick thoughts:

[have some internal goal x] [backchain from wanting x to the stuff needed to get x (doing well at training)] [figure out how to do well at training] [actually do well at training]

and in comparison, the "honest" / direct solution looks like:

[figure out how to do well at training] [actually do well at training]

I think this is a mischaracterization of the argument. The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training. So a more accurate comparison would be:

Deceptive model: [figure out how to do well at training] [actually do well at training]

Sycophantic model: [figure out how to do well at training] [actually do well at training]

Aligned model: [figure out how to be aligned] [actually be aligned]

Notably, the deceptive and sycophantic models are the same! But the difference is that they look different when we break apart the "figure out how to do well at training" part. We could do the same breakdown for the sycophantic model, which might look something like:

Sycophantic model: [load in some hard-coded specification of what it means to do well in training] [figure out how to execute on that specification in this environment] [actually do well at training]

The problem is that figuring out how to do well at training is actually quite hard, and deceptive alignment might make that problem easier by reducing it to the (potentially) simpler/easier problem of figuring out how to accomplish <insert any long-term goal here>. Whereas the sycophantic model just has to memorize a bunch of stuff about training that the deceptive model doesn't have to.

The point is that you can't just say "well, deceptive alignment results in the model trying to do well in training, so why not just learn a model that starts by trying to do well in training" for the same reason that you can't just say "well, deceptive alignment results in the model outputting this specific distribution, so why not just learn a model that memorizes that exact distribution". The entire question is about what the easiest way is to produce that distribution in terms of the inductive biases.

Also, another point that I'd note here: the sycophantic model isn't actually desirable either! So long as the deceptive model beats the aligned model in terms of the inductive biases, it's still a concern, regardless of whether it beats the sycophantic model or not. I'm pretty unsure which is more likely between the deceptive and sycophantic models, but I think both pretty likely beat the aligned model in most cases that we care about. But I'm more optimistic that we can find ways to address sycophantic models than deceptive models, such that I think the deceptive models are more of a concern.

It's entirely possible to state both "If x happened, it'd solve the problem", and "The policy we think is most likely to be effective in practice is Y". They can be put in the same statement quite simply.

That's a lot of nuance that you're trying to convey to the general public, which is a notoriously hard thing to do.

This really needs to be shouted from the rooftops.

I disagree. I think it's important that we shout from the rooftops that the existential risk from AI is real, but I disagree that we should shout from the rooftops that a sufficiently good pause would solve it (even though I agree with Paul that it is true). I talk about this in this comment.

Historically, I think that a lot of causes have been hurt by a sort of purity-testing where scientists are forced to endorse the most extreme policy, even if it's not the best policy, on the idea that it would solve the problem in theory if you had a magic button that enacted it. Consider, for example, the idea that climate scientists should all have to endorse the idea that, if we ended capitalism, it would solve climate change. Though true, I do not think that would help the cause of climate change! Even if climate change were enough of an existential risk that it was worth sacrificing our entire economy for it (as I think is probably true of AI risk), it would still not be the case that advocating for that would be at all helpful, because there are much more effective ways of addressing climate change that starting a communist revolution.

I think everyone should be clear about what they think the risks are, but I think forcing people to publicly endorse policies that they don't endorse in practice just because they would solve the problem in theory is not a recipe for policy success.

I agree that it is important to be clear about the potential for catastrophic AI risk, and I am somewhat disappointed in the answer above (though I think calling "I don't know" lying is a bit of a stretch). But on the whole, I think people have been pretty upfront about catastrophic risk, e.g. Dario has given an explicit P(doom) publicly, all the lab heads have signed the CAIS letter, etc.

Notably, though, that's not what the original post is primarily asking for: it's asking for people to clearly state that they agree that we should pause/stop AI development, not to clearly state that that they think AI poses a catastrophic risk. I agree that people should clearly state that they think there's a catastrophic risk, but I disagree that people should clearly state that they think we should pause.

Primarily, that's because I don't actually think trying to get governments to enact some sort of a generic pause would make good policy. Analogizing to climate change, I think getting scientists to say publicly that they think climate change is a real risk helped the cause, but putting pressure on scientists to publicly say that environmentalism/degrowth/etc. would solve the problem has substantially hurt the cause (despite the fact that a magic button that halved consumption would probably solve climate change).

I mean, whether something's realistic and whether something's actionable are two different things (both separate from whether something's nebulous) - even if it's hard to make a pause happen, I have a decent guess about what I'd want to do to up those odds: protest, write to my congress-person, etc.

Sure—I just think it'd be better to spend that energy advocating for good RSPs instead.

To be clear, the whole point of my post is that I am in favor of pausing/stopping AI development—I just think the best way to do that is via RSPs.

I'm happy to state on the record that, if I had a magic button that I could press that would stop all AGI progress for 50 years, I would absolutely press that button. I don't agree with the idea that it's super important to trot everyone out and get them to say that publicly, but I'm happy to say it for myself.

If you think that Anthropic and other labs that adopt these are fundamentally well meaning and trying to do the right thing, you'll assume that we are by default heading down path #1. If you are more cynical about how companies are acting, then #2 may seem more plausible.

I disagree that what you think about a lab's internal motivations should be very relevant here. For any particular lab/government adopting any particular RSP, you can just ask, does having this RSP make it easier or harder to implement future good legislation? My sense is that the answer to that question should mostly depend on whether the substance of the RSP is actually better-than-nothing or not, and what your general models of politics are, rather than any facts about people's internal motivations—especially since trying to externally judge the motivation of a company with huge PR resources is a fundamentally fraught thing to do.

Furthermore, my sense is that, most of the time, the crux here tends to be more around models of how politics works. If you think that there's only a very narrow policy window to get in some policy and if you get the wrong policy in you miss your shot, then you won't be willing to accept an RSP that is good but insufficient on its own. I tend to refer to this as the "resource mindset"—you're thinking of political influence, policy windows, etc. as a limited resource to be spent wisely. My sense, though, is that the resource mindset is just wrong when applied to politics—the right mindset, I think, is a positive-sum mindset, where small better-than-nothing policy actions yield larger, even-better-than-nothing policy actions, until eventually you build up to something sufficient.

Certainly I could imagine situations where an RSP is crafted in such a way as to try to stymie future regulation, though I think doing so is actually quite hard:

  • Governments have sovereignty, so you can't just restrict what they'll do in the future.
  • Once a regulatory organization exists for something, it's very easy to just give it more tasks, make it stricter, etc., and much harder to get rid of it, so the existence of previous regulation generally makes new regulation easier not harder.
  • At least in democracies, leaders regularly come and go, and tend to like to get their new thing passed without caring that much about repealing the old thing, so different overlapping regulations can easily pile up.

Of course, that's not to say that we shouldn't still ask for RSPs that make future regulation even more likely to be good, e.g. by:

  • Not overclaiming about what sort of stuff is measurable (e.g. not trying to formalize simple metrics for alignment that will be insufficient).
  • Leaving open clear and obvious holes to be filled later by future regulation.
  • etc.
Load More