I think we basically agree. I would also prefer people to think more about the middle case. Indeed, when I use the term mesa-optimiser, I usually intend to talk about the middle picture, though strictly that’s sinful as the term is tied to Optimisers.
Re: inner alignment
I think it’s basically the right term. I guess in my mind I want to say something like, “Inner Alignment is the problem of aligning objectives across the Mesa≠Base gap”, which shows how the two have slightly different shapes. But the difference isn’t really important.
Inner alignment gap? Inner objective gap?
I’m not talking about finding on optimiser-less definition of goal-directedness that would support the distinction. As you say, that is easy. I am interested in a term that would just point to the distinction without taking a view on the nature of the underlying goals.
As a side note I think the role of the intentional stance here is more subtle than I see it discussed. The nature of goals and motivation in an agent isn’t just a question of applying the intentional stance. We can study how goals and motivation work in the brain neuroscientifically (or at least, the processes in the brain that resemble the role played by goals in the intentional stance picture), and we experience goals and motivations directly in ourselves. So, there is more to the concepts than just taking an interpretative stance, though of course to the extent that the concepts (even when refined by neuroscience) are pieces of a model being used to understand the world, they will form part of an interpretative stance.
I understand that, and I agree with that general principle. My comment was intended to be about where to draw the line between incorrect theory, acceptable theory, and pre-theory.
In particular, I think that while optimisation is too much theory, goal-directedness talk is not, despite being more in theory-land than empirical malign generalisation talk. We should keep thinking of worries on the level of goals, even as we’re still figuring out how to characterise goals precisely. We should also be thinking of worries on the level of what we could observe empirically.
We’re probably in agreement, but I’m not sure what exactly you mean by “retreat to malign generalisation”.
For me, mesa-optimisation’s primary claim isn’t (call it Optimisers) that agents are well-described as optimisers, which I’m happy to drop. It is the claim (call it Mesa≠Base) that whatever the right way to describe them is, in general their intrinsic goals are distinct from the reward.
That’s a specific (if informal) claim about a possible source of malign generalisation. Namely, that when intrinsic goals differ arbitrarily from the reward, then systems that competently pursue them may lead to outcomes that are arbitrarily bad according to the reward. Humans don’t pose a counterexample to that, and it seems prima facie conceptually clarifying, so I wouldn’t throw it away. I’m not sure if you propose to do that, but strictly, that’s what “retreating to malign generalisation” could mean, as malign generalisation itself makes no reference to goals.
One might argue that until we have a good model of goal-directedness, Mesa≠Base reifies goals more than is warranted, so we should drop it. But I don’t think so – so long as one accepts goals as meaningful at all, the underlying model need only admit a distinction between the goal of a system and the criterion according to which a system was selected. I find it hard to imagine a model or view that wouldn’t allow this – this makes sense even in the intentional stance, whose metaphysics for goals is pretty minimal.
It’s a shame that Mesa≠Base is so entangled with Optimisers. When I think of mesa-optimisation, I tend to think more about the former than about the latter. I wish there was a term that felt like it pointed directly to Mesa≠Base without pointing to Optimisers. The Inner Alignment Problem might be it, though it feels like it’s not quite specific enough.
I’m sympathetic to what I see as the message of this post: that talk of mesa-optimisation is too specific given that the practical worry is something like malign generalisation. I agree that it makes extra assumptions on top of that basic worry, which we might not want to make. I would like to see more focus on inner alignment than on mesa-optimisation as such. I’d also like to see a broader view of possible causes for malign generalisation, which doesn’t stick so closely to the analysis in our paper. (In hindsight our analysis could also have benefitted from taking a broader view, but that wasn’t very visible at the time.)
At the same time, speaking only in terms of malign generalisation (and dropping the extra theoretical assumptions of a more specific framework) is too limiting. I suspect that solutions to inner alignment will come from taking an opinionated view on the structure of agents, clarifying its assumptions and concepts, explaining why it actually applies to real-world agents, and offering concrete ways in which the extra structure of the view can be exploited for alignment. I’m not sure that mesa-optimisation is the right view for that, but I do think that the right view will have something to do with goal-directedness.
By that I didn’t mean to imply that we care about mesa-optimisation in particular. I think that this demo working “as intended” is a good demo of an inner alignment failure, which is exciting enough as it is. I just also want to flag that the inner alignment failure doesn’t automatically provide an example of a mesa-optimiser.
I have now seen a few suggestions for environments that demonstrate misaligned mesa-optimisation, and this is one of the best so far. It combines being simple and extensible with being compelling as a demonstration of pseudo-alignment if it works (fails?) as predicted. I think that we will want to explore more sophisticated environments with more possible proxies later, but as a first working demo this seems very promising. Perhaps one could start even without the maze, just a gridworld with keys and boxes.
I don’t know whether observing key-collection behaviour here would be sufficient evidence to count for mesa-optimisation, if the agent has too simple a policy. There is room for philosophical disagreement there. Even with that, a working demo of this environment would in my opinion be a good thing, as we would have a concrete agent to disagree about.
Ah; this does seem to be an unfortunate confusion.
I didn’t intend to make ‘utility’ and ‘reward’ terminology – that’s what ‘mesa-‘ and ‘base’ objectives are for. I wasn’t aware of the terms being used in the technical sense as in your comment, so I wanted to use utility and reward as friendlier and familiar words for this intuition-building post. I am not currently inclined to rewrite the whole thing using different words because of this clash, but could add a footnote to clear this up. If the utility/reward distinction in your sense becomes accepted terminology, I’ll think about rewriting this.
That said, the distinctions we’re drawing appear to be similar. In your terminology, a utility-maximising agent is an agent which has an internal representation of a goal which it pursues. Whereas a reward-maximising agent does not have a rich internal goal representation but instead a kind of pointer to the external reward signal. To me this suggests your utility/reward tracks a very similar, if not the same, distinction between internal/external that I want to track, but with a difference in emphasis. When either of us says ‘utility ≠ reward’, I think we mean the same distinction, but what we want to draw from that distinction is different. Would you disagree?
You’re completely right; I don’t think we meant to have ‘more formally’ there.
I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?
P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.
P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?