Conditions for Mesa-Optimization

Chris van Merwijk; Vlad Mikulik; Joar Skalse; Scott Garrabrant

The Risks from Learned Optimization paper and this sequence don't seem to talk about the possibility of mesa-optimizers developing from supervised learning and the resulting inner alignment problem. The part that gets closest is

First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

I wonder if this was intentional, and if not maybe it would be worth making a note somewhere in the paper/posts that an oracle/predictor that is trained on sufficiently diverse data using SL could also become a mesa-optimizer (especially since this seems counterintuitive and might be overlooked by AI researchers/builders). See related discussion here.

[-]Davidmanheim7y*70

I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes - Link here. The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the "opponent model," and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment - but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.

I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.

I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, "the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.”"

As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim - misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,

[-]DanielFilan7y40

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.

This is pretty confusing for me: If I'm doing gradient descent, how many times am I halving the entire search space? (although I appreciate that it's hard to come up with a better measure of optimisation)

[-]Rohin Shah7y*40

You could imagine that, if you use gradient descent to reach a loss value of $L$ , then amount of optimization applied in bits $= - log \frac{| {θ \in R^{d} : L (θ) \leq L} |}{| R^{d} |}$ . (Yes, I know I shouldn't be taking sizes of continuous vector spaces, but you know what I mean.)

[-]adamShimi5y40

I think there is a typo in your formula, because the number of bits you get is negative. Going back to Yudkowsky's post, I think the correct formula (using your approximations of sizes) is , or $- log \frac{| {θ \in R^{d} ∣ L (θ) \leq L} |}{| R^{d} |}$ to be closer to the entropy notation.

[-]Rohin Shah5y20

Yeah, you're right, fixed.

[-]Rohin Shah7y30

Minor nitpicks:

The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find a x-bit optimizer should be independent of the environment.

This sounds blatantly false to me unless the only things you count as optimizers are things like AIXI. (In particular, humans could not count as optimizers.) Especially if you've already put in some base-level optimization to narrow down the search space. You then need to "encode" the knowledge you already have into the optimizer, so that it doesn't redo the work already done by the base-level optimization. In practice this looks like inductive biases and embedding of domain knowledge.

As a particular example, many NP-hard problems become easy if they have particular structure. (Horn-SAT and 2-SAT are easy while SAT is hard; longest path is easy in DAGs but hard with general graphs.)

I get that this is a footnote and that it's a toy model that doesn't claim to mimic reality, but it seems like a very false statement to me.

$x = {argmax}_{x} \frac{P - f (x)}{N} + x$

Can you rename the LHS variable $x$ to something else, like $x^{*}$ , to avoid confusion with the $x$ on the RHS?

Algorithmic range.

Is this different from expressivity, or the size of the model's hypothesis class? I can't tell how this is different from the inductive biases section, especially its third point.

For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

I (and I suspect other ML researchers) would call these inductive biases, not algorithmic range / model expressivity.

[-]DanielFilan7y30

AFAICT, algorithmic range isn't the same thing as model capacity: I think that tabular learners have low algorithmic range, as the terms are used in this post, but high model capacity.

[-]evhub7y30

It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn't, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.

Renamed $x$ to $x^{*}$ —good catch (though editing doesn't appear to be working for me right now—it should show up in a bit)!

Algorithmic range is very similar to model capacity, except that we're thinking slightly more broadly as we're more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they're basically the same thing.

[-]evhub7y20

I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I'm keeping algorithmic range here.

[-]Rohin Shah7y10

I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both.

Fwiw, I have the opposite intuition quite strongly, but not sure it's worth debating that here.

[-]Rohin Shah7y30

Better generalization through search.

Search only generalizes well when you are able to accurately determine the available options, the consequences of selecting those options, and the utility of those consequences. It's extremely unclear whether a mesa-optimizer would be able to do all three of these things well enough for search to actually generalize. Selection and Control makes some similar points.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale.

There should be a lot of caveats to this:

I'm pretty sure that even if you remove the MCTS at test time, AlphaZero will be very good at the game. (I'm pretty sure we could find numbers for this somewhere if it was a crux. I spent two minutes looking and didn't find them.)
I'd also bet that with more compute and a larger model, AlphaZero (even without MCTS at test time) would continue improving.
AlphaZero assumes access to a perfect simulator of the environment (i.e. the rules of the game), which is why hardcoded search generalizes correctly. It's not clear what would happen if you forced AlphaZero to also learn the rules of the game. That's the setting used in Dota and StarCraft, and notably in both of those environments we did not use the AlphaZero approach, and we did see that it was "generally favorable for most of the optimization work to be done by the base optimizer". (Unless you think that OpenAI Five / AlphaStar did in fact have mesa-optimizers, and we can't tell because the neural nets are opaque.)

Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would have to learn to internally perform something like a tree search, producing a mesa-optimizer.

Given enough time and space, a DQN agent turns into a lookup table, which could encode the optimal policy for chess. I'd appreciate a rewrite of the sentence, or a footnote that says something to the effect of "assuming reasonable time and space limitations on the agent".

I also disagree with the spirit of the sentence. My intuition is that with sufficient model capacity and training time (say, all of the computing resources in the world today), you could get a very large bundle of learned heuristics that plays chess well. (Depending on your threshold for "plays chess well", AlphaZero without MCTS at test time might already reach it.) Of course, you can always amplify any such agent by throwing a small hardcoded MCTS or alpha-beta tree search on top of it, and so I'd expect the best agent at any given level of compute to be something of that form.

[-]evhub7y10

I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it's unclear how much optimization/search is currently going on inside of AlphaZero's policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.

I added a footnote regarding capacity limitations (though editing doesn't appear to be working for me right now—it should show up in a bit). As for the broader point, I think it's just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

[-]Rohin Shah7y40

you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization.

I agree with this, but I don't think it's the point that I'm making; my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

I don't think my claim depends much on the threshold of "perform well", and I suspect that if you do think the current model is performing something like a tree search, you could make the model larger and run the same training process and it would no longer perform something like a tree search.

[-]Ofer7y40

my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).

Consider the following 2-player turn-based zero-sum game as an example for a task in which "heuristics" seemingly can't replace a tree search.

The game starts with an empty string. In each turn the following things happen:

(1) the player adds to the end of the string either "A" or "B".

(2) the string is replaced with its SHA256 hash.

Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.

(Alternatively, consider the 1-player version of this game, starting with a random string.)

[-]Rohin Shah7y20

Yeah, agreed; I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

[-]Wei Dai6y40

I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

[-]Rohin Shah6y20

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

Humans and systems produced by meta learning both do reasonably well at learning, and don't do "search" (depending on how loose you are with your definition of "search").

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

It's plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven't had much experience on before, but nonetheless because they aren't made up of a two clean parts (mesa objective + capabilities) there isn't a single obvious mesa objective that the AI system is optimizing for off distribution. I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don't seem like "search" to me, though they also aren't just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

(Also, reading back through this comment thread, I'm no longer sure whether or not a neural net could learn to play at least the 1-player random version of the SHA game. Certainly in the limit it can just memorize the input-output table, but I wouldn't be surprised if it could get some accuracy even without that.)

[-]Wei Dai6y70

It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be "two clean parts" if that's better for safety.)

[-]Rohin Shah6y20

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)?

Humans don't seem to have one mesa objective that we're optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don't even think about it. Humans do lots of things that look like "changing their objective", e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it's more fulfilling.

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"?

I suspect it would be more efficient, but I'm not sure. (Mostly this is because humans and animals don't seem to have two clean parts, but quite plausibly we'll do something more interpretable than evolution and that will push towards a clean separation.) I also don't know whether it would be better for safety to have it split into two clean parts.

[-]Wei Dai6y30

Humans do lots of things that look like “changing their objective” [...]

That's true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the "right" values), it seems dangerous to let it "changing their objective" on its own. Unless, I guess, it's doing something like mild optimization or following norms, so that it can't do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it's as messy as humans are, how would we know that it's strictly following norms or doing mild optimization, and won't "change its mind" about that too at some point (kind of like a human who isn't very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?

I think overall I'm still confused about your perspective here. Do you think this kind of "messy" AI is something we should try to harness and turn into a safety success story (if so how), or do you think it's a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?

Oh, going back to an earlier comment, I guess you're suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.

[-]Rohin Shah6y20

In this entire comment thread I'm not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I'm simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,

I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

It's very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they "change their objective". It's also plausible that they're safer, e.g. because they are full-blown explicit EU maximizers and we can "convince" them to adopt goals similar to ours.

Mostly I'm saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.

[-]Wei Dai6y40

Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).

Part of what inspired me to write my comment was watching my kid play logic puzzles. When she starts a new game, she has to do a lot of random trial-and-error with backtracking, much like MCTS. (She does the trial-and-error on the physical game board, but when I play I often just do it in my head.) Then her intuition builds up and she can start to recognize solutions earlier and earlier in the search tree, sometimes even immediately upon starting a new puzzle level. Then the game gets harder (the puzzle levels slowly increase in difficulty) or moves to a new regime where her intuitions don't work, and she has to do more trial-and-error again, and so on. This sure seems like "search" to me.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

This really confuses me. Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"? Do you have a different kind of thing in mind when you think of "search"?

[-]Rohin Shah6y20

This sure seems like "search" to me.

I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.

Usually, you don't have a good model of the system, and then you need something else.

Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"?

I was thinking of following a gradient in supervised learning.

I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.

Reinforcement learning with demonstrations or a very dense reward doesn't really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.

[-]Rohin Shah7y20

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer.

In this case it seems like you'd have a policy that uses "optimization machinery" to:

Predict what other agents are going to do
Create plans to achieve some form of inner objective

So, the model-outputted-by-the-base-optimization is a policy that chooses how to use the optimization machinery, not the optimization machinery itself. This seems substantially different from your initial concept of a mesa-optimizer

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

and seems more like a subagent. But perhaps I've misunderstood what you meant by a mesa-optimizer.

[-]Vlad Mikulik7y30

The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:

For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

The idea there isn't that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that's a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.

(I'm not sure to what degree it's actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)

[-]Rohin Shah7y10

Just to make sure I understand, this example assumes that the base objective is "predict human behavior", and doesn't apply to most base objectives, right?

[-]Vlad Mikulik7y30

Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.

I do think this is a pretty speculative argument, even for this sequence.

[-]evhub7y10

The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren't mesa-optimizers. Rather, we're interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.

[-]Rohin Shah7y30

It sounds like there was a misunderstanding somewhere -- I'm aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.

[-]Rohin Shah7y10

One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm.

Given that the risk comes from the inner objective being misaligned, how does this help?

[-]evhub7y10

The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

[-]Rohin Shah7y10

Is it a requirement of mesa-optimization that the optimization algorithm must be learned? I would have expected that it would only be a requirement that the objective be learned. Are there considerations that apply to learned optimization algorithms that don't apply to hardcoded optimization algorithms?

[-]Vlad Mikulik7y30

The main benefit I see of hardcoding optimisation is that, assuming the system's pieces learn as intended (without any mesa-optimisation happening in addition to the hardcoded optimisation) you get more access and control as a programmer over what the learned objective actually is. You could attempt to regress the learned objective directly to a goal you want, or attempt to enforce a certain form on it, etc. When the optimisation itself is learned*, the optimiser is more opaque, and you have fewer ways to affect what goal is learned: which weights of your enormous LSTM-based mesa-optimiser represent the objective?

This doesn't solve the problem completely (you might still learn an objective that is very incorrect off-distribution, etc.), but could offer more control and insight into the system to the programmer.

*Of course, you can have learned optimisation where you keep track of the objective which is being optimised (like in Learning to Learn by Gradient Descent), but I'd class that more under hard-coded optimisation for the purposes of this discussion. Here I mean the kind of learned optimisation that happens where you're not building the architecture explicitly around optimising or learning to optimise.

[-]Rohin Shah7y10

That makes sense, thanks.

As of the date of this post. Note that we do examine some existing machine learning systems that we believe are close to producing mesa-optimization in post 5. ↩︎
It is worth noting that the same argument also holds for achieving an average-case guarantee. ↩︎
Assuming reasonable computational constraints.. ↩︎
This definition of $N$ is somewhat vague, as there are multiple different levels at which one can chunk an environment into instances. For example, one environment could always have the same high-level features but completely random low-level features, whereas another could have two different categories of instances that are broadly self-similar but different from each other, in which case it's unclear which has a larger $N$ . However, one can simply imagine holding $N$ constant for all levels but one and just considering how environment diversity changes on that level. ↩︎
Note that this makes the implicit assumption that the amount of optimization power required to find a mesa-optimizer capable of performing $x$ bits of optimization is independent of $N$ . The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find an $x$ -bit optimizer should be relatively independent of the environment. That being said, it won't be completely independent, but as long as the primary difference between environments is how much optimization they need, rather than how hard it is to do optimization, the model presented here should hold. ↩︎
Note, however, that there will be some maximum $x$ simply because the learned algorithm generally only has access to so much computational power. ↩︎
Subject to the constraint that $P - f (x) \geq 0$ . ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

30

Conditions for Mesa-Optimization

30

2.1. The task

2.2. The base optimizer