Update March 2023: This is too confusing, skip it.
In my last post I asked if there were more natural terms for inner alignment, outer alignment, and mesa optimizers. I'm looking for a different way to slice those problems that leads to clearer understanding, communication, and research directions. This is my attempt. I aim to use terms that are older or have apparent meaning without needing a definition.
I do not aim to raise important new questions or otherwise be especially novel; I just want to be less confused.
Whether a model optimizes its environment depends on both its ability and inclination.
We want to know the model's planning capacity in the kinds of environments it might end up in, almost with indifference to the method of planning. If an Atari agent can also somehow plan a route through town or make a coherent dinner recipe, without training on those tasks, then it is obviously quite capable. We should not disregard the ability of a system that doesn't appear to be using 'search'.
(What differentiates humans from other animals is not that one is an optimizer and the others are not; it's a question of degree. It makes more sense to ask and measure "how good is this thing at planning, and in what contexts?" than "how much of an optimizer is it?". And the word 'intelligence' has been empirically shown to be very confusing so substituting something more specific when possible would be good.)
This capability is mediated by the model's internal preferences (or utility function) over the relevant ontology. GPT has significant capacity to plan a diamond heist and strong preference towards coherence (state space = words) but is indifferent e.g. as to whether the reader executes the plan (state space = real world outcomes).
We often gesture at internal preferences indirectly and prefer more objective terminology. It is difficult to measure and specify, but internal preferences are the fundamental determinant of a model's behavior as its abilities become very strong. There is an argument that goes "speaking of the utility function of this 'agent' makes no sense because it's just trying to classify pictures. Utility functions must be irrelevant." Sometimes models are absolutely indifferent over anything outside their own output space. In some ways this is quite a good property to have. The notion hasn't fallen apart; it's actually quite a clear case. We often use observed preferences as a proxy for this. What kind of preferences you estimate from behavior is determined by your preference estimation method.
It is difficult to measure the planning capacity of e.g. GLM-130B because we have trouble designing tests that elicit the full capabilities of the system. We could call a test which shows the full ability of the system a generous test. The test that allows the model to accomplish a given task most effectively could be called the most generous test. A test that cause the model to perform below its real abilities might be a prejudiced test.
Less-vague things you can say with these words:
Research questions in this language:
As for mesa-optimizers, one should not be surprised when their Atari agent is optimizing the Atari game. It was trained to do exactly this. (We are instead interested in exactly how general that agent's planning capability is.) In contrast, a spontaneous optimization daemon is an agent that forms spontaneously inside any dynamic system proceeding through time. Reproduction and sexual selection and predator-prey dynamics are not hard-coded into the universe; they all spontaneously occurred inside a vibrating box of sand. (Kind of stuns me that the universe is not really an optimizer at all, or any more than cellular automaton is, yet produced humans.)
But it's worth naming the intermediate process between physics and humans. I'd say natural selection is a selection process. Seeing humans among the atoms is quite weird but the daemon looks a bit less spontaneous once the more-sophisticated selection process comes into view. (More in appendix.)
Research question random thought: Has anyone tried running noisy game of life on a supercomputer for a month and watched what happened?
The problem of taking what you want the model to do and turning it into code is already known as (reward or loss or value or task or objective) specification. Whether you are giving the model examples to imitate, a loss function to minimize, or something else, the intention and risks are largely the same.
We know the goal of maintaining desired behavior under distribution shift as robustness. Of course an RL agent will not typically be estimating true reward perfectly, this is nothing new. This is, even more mundanely, known as test loss. Perhaps call it production loss, since everyone is always looking at their test loss and it's not much of a test. The distinction between capability robustness and objective robustness is a good one, and clear as is.
We know a failure of objective robustness as goal misgeneralization and capability robustness
Things you can say with these words:
(Obvious) research questions:
Are these the true names — are we calling the problems what they are? Are the research questions above pointing at real things? Does it lead to direct investigations of the territory? Is it actually any easier to assess the general planning capacity of a model than to answer whether it's a mesa-optimizer?
My real motivation is just to understand what people are even talking about when they write about inner alignment and mesa-optimizers.
Very curious if I've hit the mark. Happy to receive feedback about better/clearer/older terms for these concepts, alternative breakdowns, or important gaps.
I can think of some ways to accomplish objectives without having any planning capacity per-se.
One might call these memorizers, direct implementations, an embedded selection process, and dynamic systems (not an optimizer).
So is the term 'planning capacity' to restrictive? Are we excluding important cases from our field of view? Well, an ideal memorizer does effectively behave as an ideal planner. Already knowing what happens next is not a lot different from being able to imagine it. So one can still speak of gpt3 as having instantiated a text planning system, even if it is implemented as compressed memorization under the hood. It makes sense to ask how well or how generally it makes plans, or how far ahead it looks. And planning capacity feels more natural and important to me than memorization.
Selection processes are certainly quite different from either of these. I'd say that when something does that we'll probably know it / see it / have specified it. It's such a unique case that control & analysis methods from planners and memorizers will transfer very poorly.
So I think the term planning capacity is a bit narrow but feels about right for where we are and the questions we're asking right now.
I say model instead of agent because not all models act much like agents; for many purposes we don't need agents. If you refer to models as learned algorithms then nobody will know what you're talking about.
Of course excluding inputs and the operating environment.
Apologies if there are existing terms for these