Edouard Harris

Independent researcher.


Clarifying inner alignment terminology

Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.

Clarifying inner alignment terminology

Great post! Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.

I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.

This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system. 

Zooming in on the "inner alignment  objective robustness" part of the diagram, I think what's actually going on is something like:

  1. A human AI researcher wishes to optimize for some base objective, .
  2. It would take too much work for our researcher to optimize for  manually. So our researcher builds an agent to do the work instead, and sets  to be the agent's loss function.
  3. Depending on how it's built, the agent could end up optimizing for , or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent's behavioral objective — let's call it . If  is aligned with , then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.

    But here's the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent's "behavioral objective" is really just the mesa-objective of that mesa-optimizer.
  4. And now, we've got an agent that wishes to optimize for some mesa-objective . (Its "behavioral objective" by the above definition.)
  5. And then our agent builds a sub-agent to do the work instead, and sets  to be the sub-agent's loss function.
  6. I'm sure you can see where I'm going with this by now, but the sub-agent the agent builds will have its own objective  which may or may not be aligned with , which may or may not in turn be aligned with . From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it's actually a "mesa-mesa-optimizer".

That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".

In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)

Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.

A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!

Biextensional Equivalence

Really interesting!

I think there might be a minor typo in Section 2.2:

For transitivity, assume that for 

I think this should be  based on the indexing in the rest of the paragraph.

Defining capability and alignment in gradient descent

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like , then you'd get  for any .

Thanks again!

Defining capability and alignment in gradient descent

Thanks for the comment!

Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.


Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!