I agree with pretty much this whole comment, but do have one question:
But it still seems plausible that in practice we never hit those exotic circumstances (because those exotic circumstances never happen, or because we've retrained the model before we get to the exotic circumstances, etc), and it's intent aligned in all the circumstances the model actually encounters.
Given that this is conditioned on us getting to AGI, wouldn't the intuition here be that pretty much all the most valuable things such a system would do would fall under "exotic circumstances" with respect to any realistic training distribution? I might be assuming too much in saying that — e.g., I'm taking it for granted that anything we'd call an AGI could self-improve to the point of accessing states of the world that we wouldn't be able to train it on; and also I'm assuming that the highest-reward states would probably be the these exotic / hard-to-access ones. But both of those do seem (to me) like they'd be the default expectation.
Or maybe you mean it seems plausible that, even under those exotic circumstances, an AGI may still be able to correctly infer our intent, and be incentivized to act in alignment with it?
But in the context of superhuman systems, I think we need to be more concerned by the possibility that it’s performance-uncompetitive to restrict your system to only take actions that can be justified entirely with human-understandable reasoning.
Interestingly, this is already a well known phenomenon in the hedge fund world. In fact, quant funds discovered about 25 years ago that the most consistently profitable trading signals are reliably the ones that are the least human-interpretable. It makes intuitive sense: any signal that can be understood by a human is at risk of being copied by a human, so if you insist that your trading decisions have to be interpretable, you'll pay for that insistence in alpha.
I'd imagine this kind of issue is already top-of-mind for folks who are working on the various transparency agendas, but it does imply that there's a very strong optimization pressure directly against interpretability in many economically relevant contexts. In fact, it could hardly be stronger: your forcing function is literally "Want to be a billionaire? Then you'll have to trade exclusively on the most incomprehensible signals you can find."
(Of course this isn't currently true of all hedge funds, only a few specialized ones.)
One reason to favor such a definition of alignment might be that we ultimately need a definition that gives us guarantees that hold at human-level capability or greater, and humans are probably near the bottom of the absolute scale of capabilities that can be physically realized in our world. It would (imo) be surprising to discover a useful alignment definition that held across capability levels way beyond us, but that didn't hold below our own modest level of intelligence.
No problem! Glad it was helpful. I think your fix makes sense.
I'm not quite sure what the error was in the original proof of Lemma 3; I think it may be how I converted to and interpreted the vector representation.
Yeah, I figured maybe it was because the dummy variable ℓ was being used in the EV to sum over outcomes, while the vector l was being used to represent the probabilities associated with those outcomes. Because ℓ and l are similar it's easy to conflate their meanings, and if you apply ϕ to the wrong one by accident that has the same effect as applying ϕ−1 to the other one. In any case though, the main result seems unaffected.
Thanks for writing this.
I have one point of confusion about some of the notation that's being used to prove Lemma 3. Apologies for the detail, but the mistake could very well be on my end so I want to make sure I lay out everything clearly.
First, ϕ is being defined here as an outcome permutation. Presumably this means that 1) ϕ(oi)=oj for some oi, oj; and 2) ϕ admits a unique inverse ϕ−1(oj)=oi. That makes sense.
We also define lotteries over outcomes, presumably as, e.g., L=∑ni=1ℓioi, where ℓi is the probability of outcome oi. Of course we can interpret the oi geometrically as mutually orthogonal unit vectors, so this lottery defines a point on the n-simplex. So far, so good.
But the thing that's confusing me is what this implies for the definition of ϕ−1(L). Because ϕ is defined as a permutation over outcomes (and not over probabilities of outcomes), we should expect this to be
The problem is that this seems to give a different EV from the lemma:
(Note that I'm using o as the dummy variable rather than ℓ, but the LHS above should correspond to line 2 of the proof.) Doing the same thing for the M lottery gives an analogous result. And then looking at the inequality that results suggests that lemma 3 should actually be "≺ϕ induces u(ϕ−1(oi))" as opposed to "≺ϕ induces u(ϕ(oi))".
(As a concrete example, suppose we have a lottery L=ℓ1o1+ℓ2o2+ℓ3o3 with the permutation ϕ−1(o1)=o2, ϕ−1(o2)=o3, ϕ−1(o3)=o1. Then ϕ−1(L)=ℓ1o2+ℓ2o3+ℓ3o1 and our EV is
Yet Eo∼L[u(ϕ(o))]=ℓ1u(o3)+ℓ2u(o1)+ℓ3u(o2)≠Eo∼ϕ−1(L)[u(o)] which appears to contradict the lemma as stated.)
Note that even if this analysis is correct, it doesn't invalidate your main claim. You only really care about the existence of a bijection rather than what that bijection is — the fact that your outcome space is finite ensures that the proportion of orbit elements that incentivize power seeking remains the same either way. (It could have implications if you try to extend this to a metric space, though.)
Again, it's also possible I've just misunderstood something here — please let me know if that's the case!
Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the "perfect IRL" definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.
On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.
A few consequences of the above:
This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the "generalization-focused approach" to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. ("When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?")
Would definitely be interested in getting feedback on these thoughts!
I'm with you on this, and I suspect we'd agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.
But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one's map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory.
I'm not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn't pointless (because prosaic systems will just learn the representations they need anyway). All I'm saying is that the moment we started playing the game of definitions, we'd already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can't be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that.
(I'm not 100% sure if I've interpreted your objection correctly, so please let me know if I haven't.)
Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
I'm not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a "non-naturalistic" assumption, which simply makes me think a framework is more artificial/fragile.
Oh for sure. I wouldn't recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely.
As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I'd agree that sort of thing is more fragile.
The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.
I would further add that looking for difficulties created by the simplification seems very intellectually productive.
Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)