Edouard Harris

Independent researcher.

The alignment problem in different capability regimes

But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.

Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the *least* human-interpretable. It makes intuitive sense: any signal that can be understood by a human is at risk of being copied by a human, so if you insist that your trading decisions have to be interpretable, you'll pay for that insistence in alpha.

I'd imagine this kind of issue is already top-of-mind for folks who are working on the various transparency agendas, but it does imply that there's a very strong optimization pressure directly* against* interpretability in many economically relevant contexts. In fact, it could hardly be stronger: your forcing function is literally "Want to be a billionaire? Then you'll have to trade exclusively on the most incomprehensible signals you can find."

(Of course this isn't currently true of all hedge funds, only a few specialized ones.)

The alignment problem in different capability regimes

One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that *didn't* hold below our own modest level of intelligence.

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

No problem! Glad it was helpful. I think your fix makes sense.

I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.

Yeah, I figured maybe it was because the dummy variable was being used in the EV to sum over outcomes, while the vector was being used to represent the probabilities associated with those outcomes. Because and are similar it's easy to conflate their meanings, and if you apply to the wrong one by accident that has the same effect as applying to the other one. In any case though, the main result seems unaffected.

Cheers!

When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Thanks for writing this.

I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.

First, is being defined here as an *outcome* permutation. Presumably this means that 1) for some , ; and 2) admits a unique inverse . That makes sense.

We also define lotteries over outcomes, presumably as, e.g., , where is the probability of outcome . Of course we can interpret the geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the -simplex. So far, so good.

But the thing that's confusing me is what this implies for the definition of . Because is defined as a permutation over *outcomes* (and not over *probabilities* of outcomes), we should expect this to be

The problem is that this seems to give a different EV from the lemma:

(Note that I'm using as the dummy variable rather than , but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be " induces " as opposed to " induces ".

(As a concrete example, suppose we have a lottery with the permutation , , . Then and our EV is

Yet which appears to contradict the lemma as stated.)

Note that even if this analysis is correct, it doesn't invalidate your main claim. You only really care about the *existence* of a bijection rather than *what* that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)

Again, it's also possible I've just misunderstood something here — please let me know if that's the case!

Re-Define Intent Alignment?

Update: having now thought more deeply about this, I no longer endorse my above comment.

While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:

- The
**behavioral objective**is the thing the agent is revealed to be pursuing*under arbitrary distributional shifts*. - The
**mesa-objective**is something the agent is revealed to be pursuing under*some subset of possible distributional shifts*.

Everything in the above comment then still goes through, except with these definitions reversed.

On the one hand, the "perfect IRL" definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.

On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.

A few consequences of the above:

- Under an "omnipotent experimenter" definition, the behavioral objective (and
*not*the mesa-objective) is a reliable invariant of the agent. - It's entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across
*all*those experiments is the trivial utility function that's constant across all states. - If the behavioral objective of a system is overdetermined, that
*might*mean the system never pursues anything coherently. But it might*also*mean that there exist*subsets*of distributions on which the system pursues an objective*very*coherently, but that different distributions induce different coherent objectives. - The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent
*that's consistent with the subset of distributions we're conditioning on*— then we will in general recover a set of mesa-objective candidates consistent with the system's actions on that subset. - It is strictly incorrect to refer to "the" mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it's underdetermined. (I believe Jack refers to this as a "perturbation set" downthread.)

This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the "generalization-focused approach" to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. ("When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?")

Would definitely be interested in getting feedback on these thoughts!

Re-Define Intent Alignment?

I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.

But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory.

I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to *construct* our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that.

(I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)

Re-Define Intent Alignment?

Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible *in principle* to do so in our universe.

To try to understand a bit better: does your pessimism about this come from the hardness of the *technical* challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the *definitional* challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?

Re-Define Intent Alignment?

I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.

Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what *could* be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a *second* boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally *real* object here is always going to be the whole system, absolutely.

As I understand, something like AIXI forces you to draw one *particular* boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile.

The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.

Re-Define Intent Alignment?

I would further add that looking for difficulties created by the simplification seems

very intellectually productive.

Yep, strongly agree. And a good first step to doing this is to *actually build* as robust a simplification as you can, and then see where it breaks. (Working on it.)

I agree with pretty much this whole comment, but do have one question:

Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances" with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I'm taking it for granted that anything we'd call an AGI could self-improve to the point of accessing states of the world that we wouldn't be able to train it on; and also I'm assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they'd be the default expectation.

Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?