Evan Hubinger

I am a Research Fellow at MIRI working on inner alignment for amplification.

See: "What I'll doing at MIRI."

Pronouns: he/him/his

Evan Hubinger's Comments

Exploring safe exploration

Hey Aray!

Given this, I think the "within-episode exploration" and "across-episode exploration" relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call "safe exploration".

I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.

The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.

Malign generalization without internal search

I don't feel like you're really understanding what I'm trying to say here. I'm happy to chat with you about this more over video call or something if you're interested.

Malign generalization without internal search

I think that piecewise objectives are quite reasonable and natural—and I don't think they'll make transparency that much harder. I don't think there's any reason that we should expect objectives to be continuous in some nice way, so I fully expect you'll get these sorts of piecewise jumps. Nevertheless, the resulting objective in the piecewise case is still quite simple such that you should be able to use interpretability tools to understand it pretty effectively—a switch statement is not that complicated or hard to interpret—with most of the real hard work still primarily being done in the optimization.

I do think there are a lot of possible ways in which the interpretability for mesa-optimizers story could break down—which is why I'm still pretty uncertain about it—but I don't think that a switch-case agent is such an example. Probably the case that I'm most concerned about right now is if you get an agent which has an objective which changes in a feedback loop with its optimization. If the objective and the optimization are highly dependent on each other, then I think that would make the problem a lot more difficult—and is the sort of thing that humans seem to do, which suggests that it's the sort of thing we might see in AI systems as well. On the other hand, a fixed switch-case objective is pretty easy to interpret, since you just need to understand the simple, fixed heuristics being used in the switch statement and then you can get a pretty good grasp on what your agent's objective is. Where I start to get concerned is when those switch statements themselves depend upon the agent's own optimization—a recursion which could possibly be many layers deep and quite difficult to disentangle. That being said, even in such a situation you're still using search to get your robust capabilities.

Malign generalization without internal search

Consider an agent that could, during its operation, call upon a vast array of subroutines. Some of these subroutines can accomplish extremely complicated actions, such as "Prove this theorem: [...]" or "Compute the fastest route to Paris." We then imagine that this agent still shares the basic superstructure of the pseudocode I gave initially above.

I feel like what you're describing here is just optimization where the objective is determined by a switch statement, which certainly seems quite plausible to me but also pretty neatly fits into the mesa-optimization framework.

More generally, while I certainly buy that you can produce simple examples of things that look kinda like capability generalization without objective generalization on environments like the lunar lander or my maze example, it still seems to me like you need optimization to actually get capabilities that are robust enough to pose a serious risk, though I remain pretty uncertain about that.

Outer alignment and imitative amplification

Is "outer alignment" meant to be applicable in the general case?

I'm not exactly sure what you're asking here.

Do you think it also makes sense to talk about outer alignment of the training process as a whole, so that for example if there is a security hole in the hardware or software environment and the model takes advantage of the security hole to hack its loss/reward, then we'd call that an "outer alignment failure".

I would call that an outer alignment failure, but only because I would say that the ways in which your loss function can be hacked are part of the specification of your loss function. However, I wouldn't consider an entire training process to be outer aligned—rather, I would just say that an entire training process is aligned. I generally use outer and inner alignment to refer to different components of aligning the training process—namely the objective/loss function/environment in the case of outer alignment and the inductive biases/architecture/optimization procedure in the case of inner alignment (though note that this is a more general definition than the one used in “Risks from Learned Optimization,” as it makes no mention of mesa-optimizers, though I would still say that mesa-optimization is my primary example of how you could get an inner alignment failure).

So technically, one should say that a loss function is outer aligned at optimum with respect to some model class, right?

Yes, though in the definition I gave here I just used the model class of all functions, which is obviously too large but has the nice property of being a fully general definition.

Also, related to Ofer's comment, can you clarify whether it's intended for this definition that the loss function only looks at the model's input/output behavior, or can it also take into account other information about the model?

I would include all possible input/output channels in the domain/codomain of the model when interpreted as a function.

I'm also curious whether you have HBO or LBO in mind for this post.

I generally think you need HBO and am skeptical that LBO can actually do very much.

Outer alignment and imitative amplification

I think I'm quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it's aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can't really consider HCH aligned (and thus can't consider imitative amplification outer aligned at optimum).

Exploring safe exploration

Like I said in the post, I'm skeptical that “preventing the agent from making an accidental mistake” is actually a meaningful concept (or at least, it's a concept with many possible conflicting definitions), so I'm not sure how to give an example of it.

Exploring safe exploration

I definitely was not arguing that. I was arguing that safe exploration is currently defined in ML as the agent making an accidental mistake, and that we should really not be having terminology collisions with ML. (I may have left that second part implicit.)

Ah, I see—thanks for the correction. I changed “best” to “current.”

I assume that the difference you see is that you could try to make across-episode exploration less detrimental from the agent's perspective

No, that's not what I was saying. When I said “reward acquisition” I meant the actual reward function (that is, the base objective).


That being said, it's a little bit tricky in some of these safe exploration setups to draw the line between what's part of the base objective and what's not. For example, I would generally include the constraints in constrained optimization setups as just being part of the base objective, only specified slightly differently. In that context, constrained optimization is less of a safe exploration technique and more of a reward-engineering-y/outer alignment sort of thing, though it also has a safe exploration component to the extent that it constrains across-episode exploration.

Note that when across-episode exploration is learned, the distinction between safe exploration and outer alignment becomes even more muddled, since then all the other terms in the loss will implicitly serve to check the across-episode exploration term, as the agent has to figure out how to trade off between them.[1]

  1. This is another one of the points I was trying to make in “Safe exploration and corrigibility” but didn't do a great job of conveying properly. ↩︎

Safe exploration and corrigibility

I completely agree with the distinction between across-episode vs. within-episode exploration, and I agree I should have been clearer about that. Mostly I want to talk about across-episode exploration here, though when I was writing this post I was mostly motivated by the online learning case where the distinction is in fact somewhat blurred, since in an online learning setting you do in fact need the deployment policy to balance between within-episode exploration and across-episode exploration.

Usually (in ML) "safe exploration" means "the agent doesn't make a mistake, even by accident"; ϵ-greedy exploration wouldn't be safe in that sense, since it can fall into traps. I'm assuming that by "safe exploration" you mean "when the agent explores, it is not trying to deceive us / hurt us / etc".

Agreed. My point is that “If you assume that the policy without exploration is safe, then for -greedy exploration to be safe on average, it just needs to be the case that the environment is safe on average, which is just a standard engineering question.” That is, even though it seems like it's hard for -greedy exploration to be safe, it's actually quite easy for it to be safe on average—you just need to be in a safe environment. That's not true for learned exploration, though.

Since by default policies can't affect across-episode exploration, I assume you're talking about within-episode exploration. But this happens all the time with current RL methods

Yeah, I agree that was confusing—I'll rephrase it. The point I was trying to make was that across-episode exploration should arise naturally, since an agent with a fixed objective should want to be modified to better pursue that objective, but not want to be modified to pursue a different objective.

This sounds to me like reward uncertainty, assistance games / CIRL, and more generally Stuart Russell's agenda, except applied to mesa optimization now. Should I take away something other than "we should have our mesa optimizers behave like the AIs in assistance games"? I feel like you are trying to say something else but I don't know what.

Agreed that there's a similarity there—that's the motivation for calling it “cooperative.” But I'm not trying to advocate for that agenda here—I'm just trying to better classify the different types of corrigibility and understand how they work. In fact, I think it's quite plausible that you could get away without cooperative corrigibility, though I don't really want to take a stand on that right now.

I thought we were talking about "the agent doesn't try to deceive us / hurt us by exploring", which wouldn't tell us anything about the problem of "the agent doesn't make an accidental mistake".

If your definition of “safe exploration” is “not making accidental mistakes” then I agree that what I'm pointing at doesn't fall under that heading. What I'm trying to point at is that I think there are other problems that we need to figure out regarding how models explore than just the “not making accidental mistakes” problem, though I have no strong feelings about whether or not to call those other problems “safe exploration” problems.

The same way as capability exploration; based on value of information (VoI). (I assume you have a well-specified distribution over objectives; if you don't, then there is no proper way to do it, in the same way there's no proper way to do capability exploration without a prior over what you might see when you take the new action.)

Agreed, though I don't think that's the end of the story. In particular, I don't think it's at all obvious what an agent that cares about the value of information that its actions produce relative to some objective distribution will look like, how you could get such an agent, or how you could verify when you had such an agent. And, even if you could do those things, it still seems pretty unclear to me what the right distribution over objectives should be and how you should learn it.

The algorithms used are not putting dampers on exploration; they are trying to get the agent to do better exploration (e.g. if you crashed into the wall and saw that that violated a constraint, don't crash into the wall again just because you forgot about that experience).

Well, what does “better exploration” mean? Better across-episode exploration or better within-episode exploration? Better relative to the base objective or better relative to the mesa-objective? I think it tends to be “better within-episode exploration relative to the base objective,” which I would call putting a damper on instrumental exploration, which does across-episode and within-episode exploration only for the mesa-objective, not the base objective.

If you have the right uncertainty, then acting optimally to maximize that is the "right" thing to do.

Sure, but as you note getting the right uncertainty could be quite difficult, so for practical purposes my question is still unanswered.

Inductive biases stick around

I just edited the last sentence to be clearer in terms of what I actually mean by it.

Load More