Clarifying mesa-optimization

Pierre Peigné

Overall, strong upvote, I like this post a lot, these seem like good updates you've made.

we think that mesa-optimizers will primarily use a complicated stack of heuristics that takes elements from different clean optimization procedures. In the future, these internal heuristics might be combined with external optimization procedures like calculators or physics engines. This is similar to how humans that play chess don't actually run a tree-search of depth n with alpha-beta pruning in their heads.

I agree. Heuristic-free search seems very inefficient and inappropriate for real-world intelligence.

we think it will be much harder to learn something about search in a toy model and transfer that to a larger model because the kind of mesa-optimization is much more messy and diverse than this hypothesis assumes.

I agree. However, I agree with this as an argument against direct insight transfer from toy->real-world models. If you don't know how to do anything with anything for how e.g. an adult would plan real-world takeover, start simple IMO.

Second, we expect that when general-purpose models like GPT-3 are playing chess, they do not call an internal optimizer. Instead, they might apply heuristics that either have small components of optimization procedures or are approximations of aspects of explicit optimization. We expect that most of the decisions will come from highly refined heuristics learned from the training data.

First, thanks for making falsifiable predictions. Strong upvote for that. Second, I agree with this point. See also my made-up account of what might happen in a kid's brain when he decides to wander away from his distracting friends. (It isn't explicit search.)

However, I expect there to be something like... generally useful predictive- and behavior-modifying circuits (aliased to "general-purpose problem-solving module", perhaps), such that they get subroutine-called by many different value shards. Even though I think those subroutines are not going to be MCTS.

On a more personal note, thinking about this post made us more hopeful that mesa-optimization increases gradually and we thus get a bit of time to study it before it is too powerful but it made us more pessimistic about finding general tools that can tell us whether the model is currently doing mesa-optimization.

I feel only somewhat interested in "how much mesaoptimization is happening?", and more interested in "what kinds of cognitive work is being done, and how, and towards what ends?" (IE what are the agent's values, and how well are they being worked towards?)

^{^}

By ‘clean’ we mean something like "could easily be implemented in a Python program" or “applies the same simple step over and over again” as opposed to an ‘unclean’ heuristic that is hard to put into a formal framework.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

20

Clarifying mesa-optimization

20

What is an accurate definition of mesa-optimization?

What is a definition of mesa-optimization we care about?

How do real-world models do mesa-optimization?

Falsifiable predictions

General implications