[AN #58] Mesa optimization: what it is, and why we should care

Rohin Shah

Find all Alignment Newsletter resources here. In particular, you can sign up, or look through this spreadsheet of all summaries that have ever been in the newsletter. I'm always happy to hear feedback; you can send it to me by replying to this email.

Highlights

Risks from Learned Optimization in Advanced Machine Learning Systems (Evan Hubinger et al): Suppose you search over a space of programs, looking for one that plays TicTacToe well. Initially, you might find some good heuristics, e.g. go for the center square, if you have two along a row then place the third one, etc. But eventually you might find the minimax algorithm, which plays optimally by searching for the best action to take. Notably, your outer optimization over the space of programs found a program that was itself an optimizer that searches over possible moves. In the language of this paper, the minimax algorithm is a mesa optimizer: an optimizer that is found autonomously by a base optimizer, in this case the search over programs.

Why is this relevant to AI? Well, gradient descent is an optimization algorithm that searches over the space of neural net parameters to find a set that performs well on some objective. It seems plausible that the same thing could occur: gradient descent could find a model that is itself performing optimization. That model would then be a mesa optimizer, and the objective that it optimizes is the mesa objective. Note that while the mesa objective should lead to similar behavior as the base objective on the training distribution, it need not do so off distribution. This means the mesa objective is pseudo aligned; if it also leads to similar behavior off distribution it is robustly aligned.

A central worry with AI alignment is that if powerful AI agents optimize the wrong objective, it could lead to catastrophic outcomes for humanity. With the possibility of mesa optimizers, this worry is doubled: we need to ensure both that the base objective is aligned with humans (called outer alignment) and that the mesa objective is aligned with the base objective (called inner alignment). A particularly worrying aspect is deceptive alignment: the mesa optimizer has a long-term mesa objective, but knows that it is being optimized for a base objective. So, it optimizes the base objective during training to avoid being modified, but at deployment when the threat of modification is gone, it pursues only the mesa objective.

As a motivating example, if someone wanted to create the best biological replicators, they could have reasonably used natural selection / evolution as an optimization algorithm for this goal. However, this then would lead to the creation of humans, who would be mesa optimizers that optimize for other goals, and don't optimize for replication (e.g. by using birth control).

The paper has a lot more detail and analysis of what factors make mesa-optimization more likely, more dangerous, etc. You'll have to read the paper for all of these details. One general pattern is that, when using machine learning for some task X, there are a bunch of properties that affect the likelihood of learning heuristics or proxies rather than actually learning the optimal algorithm for X. For any such property, making heuristics/proxies more likely would result in a lower chance of mesa-optimization (since optimizers are less like heuristics/proxies), but conditional on mesa-optimization arising, makes it more likely that it is pseudo aligned instead of robustly aligned (because now the pressure for heuristics/proxies leads to learning a proxy mesa-objective instead of the true base objective).

Rohin's opinion: I'm glad this paper has finally come out. The concepts of mesa optimization and the inner alignment problem seem quite important, and currently I am most worried about x-risk caused by a misaligned mesa optimizer. Unfortunately, it is not yet clear whether mesa optimizers will actually arise in practice, though I think conditional on us developing AGI it is quite likely. Gradient descent is a relatively weak optimizer; it seems like AGI would have to be much more powerful, and so would require a learned optimizer (in the same way that humans can be thought of as "optimizers learned by evolution").

There still is a lot of confusion and uncertainty around the concept, especially because we don't have a good definition of "optimization". It also doesn't help that it's hard to get an example of this in an existing ML system -- today's systems are likely not powerful enough to have a mesa optimizer (though even if they had a mesa optimizer, we might not be able to tell because of how uninterpretable the models are).

Read more: Alignment Forum version

Technical AI alignment

Agent foundations

Selection vs Control (Abram Demski): The previous paper focuses on mesa optimizers that are explicitly searching across a space of possibilities for an option that performs well on some objective. This post argues that in addition to this "selection" model of optimization, there is a "control" model of optimization, where the model cannot evaluate all of the options separately (as in e.g. a heat-seeking missile, which can't try all of the possible paths to the target separately). However, these are not cleanly separated categories -- for example, a search process could have control-based optimization inside of it, in the form of heuristics that guide the search towards more likely regions of the search space.

Rohin's opinion: This is an important distinction, and I'm of the opinion that most of what we call "intelligence" is actually more like the "control" side of these two options.

Learning human intent

Imitation Learning as f-Divergence Minimization (Liyiming Ke et al) (summarized by Cody): This paper frames imitation learning through the lens of matching your model's distribution over trajectories (or conditional actions) to the distribution of an expert policy. This framing of distribution comparison naturally leads to the discussion of f-divergences, a broad set of measures including KL and Jenson-Shannon Divergences. The paper argues that existing imitation learning methods have implicitly chosen divergence measures that incentivize "mode covering" (making sure to have support anywhere the expert does) vs mode collapsing (making sure to only have support where the expert does), and that the latter is more appropriate for safety reasons, since the average between two modes of an expert policy may not itself be a safe policy. They demonstrate this by using a variational approximation of the reverse-KL distance as the divergence underlying their imitation learner.

Cody's opinion: I appreciate papers like these that connect peoples intuitions between different areas (like imitation learning and distributional difference measures). It does seem like this would even more strongly lead to lack of ability to outperform the demonstrator, but that's honestly more a critique of imitation learning more generally than this paper in particular.

Handling groups of agents

Social Influence as Intrinsic Motivation for Multi-Agent Deep RL (Natasha Jaques et al) (summarized by Cody): An emerging field of common-sum multi-agent research asks how to induce groups of agents to perform complex coordination behavior to increase general reward, and many existing approaches involve centralized training or hardcoding altruistic behavior into the agents. This paper suggests a new technique that rewards agents for having a causal influence over the actions of other agents, in the sense that the actions of the pair of agents agents have high mutual information. The authors empirically find that having even a small number of agents who act as "influencers" can help avoid coordination failures in partial information settings and lead to higher collective reward. In one sub-experiment, they only add this influence reward to the agents' communication channels, so agents are incentivized to provide information that will impact other agents' actions (this information is presumed to be truthful and beneficial since otherwise it would subsequently be ignored).

Cody's opinion: I'm interested by this paper's finding that you can generate apparently altruistic behavior by incentivizing agents to influence others, rather than necessarily help others. I also appreciate the point that was made to train in a decentralized way. I'd love to see more work on a less asymmetric version of influence reward; currently influencers and influencees are separate groups due to worries about causal feedback loops, and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.

Uncertainty

ICML Uncertainty and Robustness Workshop Accepted Papers (summarized by Dan H): The Uncertainty and Robustness Workshop accepted papers are available. Topics include out-of-distribution detection, generalization to stochastic corruptions, label corruption robustness, and so on.

Miscellaneous (Alignment)

To first order, moral realism and moral anti-realism are the same thing (Stuart Armstrong)

AI strategy and policy

Grover: A State-of-the-Art Defense against Neural Fake News (Rowan Zellers et al): Could we use ML to detect fake news generated by other ML models? This paper suggests that models that are used to generate fake news will also be able to be used to detect that same fake news. In particular, they train a GAN-like language model on news articles, that they dub GROVER, and show that the generated articles are better propaganda than those generated by humans, but they can at least be detected by GROVER itself.

Notably, they do plan to release their models, so that other researchers can also work on the problem of detecting fake news. They are following a similar release strategy as with GPT-2 (AN #46): they are making the 117M and 345M parameter models public, and releasing their 1.5B parameter model to researchers who sign a release form.

Rohin's opinion: It's interesting to see that this group went with a very similar release strategy, and I wish they had written more about why they chose to do what they did. I do like that they are on the face of it "cooperating" with OpenAI, but eventually we need norms for how to make publication decisions, rather than always following the precedent set by someone prior. Though I suppose there could be a bit more risk with their models -- while they are the same size as the released GPT-2 models, they are better tuned for generating propaganda than GPT-2 is.

Read more: Defending Against Neural Fake News

The Hacker Learns to Trust (Connor Leahy): An independent researcher attempted to replicate GPT-2 (AN #46) and was planning to release the model. However, he has now decided not to release, because releasing would set a bad precedent. Regardless of whether or not GPT-2 is dangerous, at some point in the future, we will develop AI systems that really are dangerous, and we need to have adequate norms then that allow researchers to take their time and evaluate the potential issues and then make an informed decision about what to do. Key quote: "sending a message that it is ok, even celebrated, for a lone individual to unilaterally go against reasonable safety concerns of other researchers is not a good message to send".

Rohin's opinion: I quite strongly agree that the most important impact of the GPT-2 decision was that it has started a discussion about what appropriate safety norms should be, whereas before there were no such norms at all. I don't know whether or not GPT-2 is dangerous, but I am glad that AI researchers have started thinking about whether and how publication norms should change.

Other progress in AI

Reinforcement learning

A Survey of Reinforcement Learning Informed by Natural Language (Jelena Luketina et al) (summarized by Cody): Humans use language as a way of efficiently storing knowledge of the world and instructions for handling new scenarios; this paper is written from the perspective that it would be potentially hugely valuable if RL agents could leverage information stored in language in similar ways. They look at both the case where language is an inherent part of the task (example: the goal is parameterized by a language instruction) and where language is used to give auxiliary information (example: parts of the environment are described using language). Overall, the authors push for more work in this area, and, in particular, more work using external-corpus-pretrained language models and with research designs that use human-generated rather than synthetically-generated language; the latter is typically preferred for the sake of speed, but the former has particular challenges we'll need to tackle to actually use existing sources of human language data.

Cody's opinion: This article is a solid and useful version of what I would expect out of a review article: mostly useful as a way to get thinking in the direction of the intersection of RL and language, and makes me more interested in digging more into some of the mentioned techniques, since by design this review didn't go very deep into any of them.

Deep learning

the transformer … “explained”? (nostalgebraist) (H/T Daniel Filan): This is an excellent explanation of the intuitions and ideas behind self-attention and the Transformer architecture (AN #44).

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning (Tom Schaul et al) (summarized by Cody): The authors argue that Deep RL is subject to a particular kind of training pathology called "ray interference", caused by situations where (1) there are multiple sub-tasks within a task, and the gradient update of one can decrease performance on the others, and (2) the ability to learn on a given sub-task is a function of its current performance. Performance interference can happen whenever there are shared components between notional subcomponents or subtasks, and the fact that many RL algorithms learn on-policy means that low performance might lead to little data collection in a region of parameter space, and make it harder to increase performance there in future.

Cody's opinion: This seems like a useful mental concept, but it seems quite difficult to effectively remedy, except through preferring off-policy methods to on-policy ones, since there isn't really a way to decompose real RL tasks into separable components the way they do in their toy example

Meta learning

Alpha MAML: Adaptive Model-Agnostic Meta-Learning (Harkirat Singh Behl et al)

22