Risks from Learned Optimization: Introduction

Chris van Merwijk; Vlad Mikulik; Joar Skalse; Scott Garrabrant

[-]adamShimi5y150Review for 2019 Review

In “Why Read The Classics?”, Italo Calvino proposes many different definitions of a classic work of literature, including this one:

A classic is a book which has never exhausted all it has to say to its readers.

For me, this captures what makes this sequence and corresponding paper a classic in the AI Alignment literature: it keeps on giving, readthrough after readthrough. That doesn’t mean I agree with everything in it, or that I don’t think it could have been improved in terms of structure. But when pushed to reread it, I found again and again that I had missed or forgotten some nice argument, some interesting takeaway.

With that, a caveat: I’m collaborating with Evan Hubinger (one of the authors) on projects related to ideas introduced in this sequence, especially to Deceptive Alignment. I am thus probably biased positively about this work. That being said, I have no problem saying I disagree with collaborators, so I don’t think I’m too biased to write this review.

(Small point: I among other people tend to describe this sequence/paper as mainly Evan’s work, but he repeatedly told me that everyone participated equally, and that the names are in alphabetic order, not contribution order. So let’s keep that in mind)

Summary

Let’s start the review proper with a post by post summary (except for the conclusion):

(Introduction) This first post introduces the idea of mesa-optimizers, the learned optimizers from the title. A mesa-optimizer is an optimizer which is the result of a learning process, and it comes with the issue of inner alignment: how aligned is the objective of the mesa-optimizer (on which we don’t have direct control) with the objective of the base-optimizer that produced this mesa-optimizer?
The post then split the safety questions related to mesa-optimizer in two categories: understanding which conditions make mesa-optimizer appear; and understanding how aligned is the mesa-objective with the base-objective.
(Conditions for Mesa-Optimization) This post tackles the first category outlined in the introduction: how can mesa-optimizers be learned? The task can push towards mesa-optimization by asking for more generalization (which is probably easier to deliver through search), by requiring a compressed complex policy, or by requiring human modeling (which probably entails understanding optimization and search in some sense). The base-optimiser can push towards mesa-optimization if it is reachable (not surrounded by high-loss solutions), if the models considered have enough algorithmic range, or more generally through details of the inductive bias like statefulness and simplicity bias.
(The Inner Alignment Problem) This post tackles the second category outlined in the introduction: if a mesa-optimizer does appear, how aligned will it be with the base objective? The misalignment considered here is called pseudo-alignment: being aligned on the training distribution but not at deployment. The authors propose to split pseudo-alignment in three subcategories:
- Proxy alignment, where the mesa-objective is a proxy for the base-objective on the training distribution, but not necessarily elsewhere.
- Approximate alignment, where the difference comes from the inability of the mesa-optimizer to represent the base-objective, and thus it learns an approximation.
- Suboptimality alignment, where the mesa-objective is not the base-objective at all, but the mesa-optimizer makes decisions (through mistakes or deception) on the training distribution that fit with the base-objective even if it contradicts the mesa-objective.
The post also explores how the task and the base-optimizer can influence the apparition of pseudo-alignment assuming mesa-optimizers, and on which subcategory it falls.
(Deceptive Alignment) This post focuses on a specific instance of suboptimality alignment: deceptive alignment, where the mesa-optimizer is trying to deceive the base-optimizer in order to not be deployed and then change its behavior to pursue the mesa-objective.
Among other ideas, the discussion examines necessary conditions for deceptive alignment (objective across parameter updates, self-modeling as learned model, expect eventual deployment without modification), how training can reinforce deception, and whether making the deceptive system thinks it’s still in training might deal with the problem.

Value

What is new in this? After all, the idea that training on a reward/objective might result in a model that doesn’t generalize correctly is hardly newsworthy, and wasn’t in 2019.

What’s missing here is replacing this idea in the context of safety. I’m always worried about saying “This is the first place some concept has been defined/mentioned”. But it’s safe to say that a lot of AI Alignment resources prior to this sequence centered around finding the right objective. The big catastrophic scenarios came from issues like the Orthogonality Thesis and the fragility of value, for which the solution seems obviously to find the right objective, and maybe add/train for good properties like corrigibility. Yet both ML theory and practice already knew that issues didn’t stop there.

So the value of this sequence comes in recasting the known generalization problems from classic ML in the context of alignment, in a public and easily readable form. Remember, I’m hardly saying nobody knew about it in the AI Alignment community before that sequence. But it is hard to find well-read and cited posts and discussions about the subject predating this sequence. I for one didn’t really think about such issues before reading this sequence and starting to work with Evan.

The other big contribution of this sequence is the introduction of deceptive alignment. Considering deception from within the trained model during its training is similar to some other previous ideas about deception (for example boxed AI getting out), but to my knowledge this is the first full fledged argument for how this could appear from local search, and even be maintained and reinforced. So deceptive alignment can be seen as recasting a traditional AI risk in the more recent context of prosaic AGI.

Criticisms

One potential issue with the sequence is its use of optimizers (programs doing explicit internal search over policies) as the problematic learned models. It makes sense from the formal point of view, since this assumption simplifies the analysis of the corresponding mesa-optimizers, and allows a relatively straightforward definition of notions like mesa-objective and inner alignment.

Yet this assumption has been criticized by multiple researchers in the community. For example, RIchard Ngo argues that for the kind of models trained through local search (like neural networks), it’s not obvious what “doing internal search means. Others, like Tom Everitt, defend that systems not doing internal search should be included in the discussion of inner alignment.

I’m sympathetic to both criticisms and would like to see someone attempt a similar take without this assumption -- see the directions for further research below.

Another slight issue I have with this sequence comes from its density: some very interesting ideas end up getting lost in it. As one example, take the tradeoff from reducing time complexity, which helps to not create mesa-optimizer but increase the risk of pseudo-alignment if mesa-optimizers do appear. The first part is discussed in Conditions for Mesa-Optimization, and the second in The Inner Alignment Problem. But it’s deep inside the text -- there’s no way for a casual reader or a quick rereader to know it is here. I think this could have been improved, even if it’s almost nitpicking at this point.

Follow-up research

What was the influence of this sequence? Google scholar returns only 8 citations, but this is misguided -- most of the impact is on researchers who don’t publish papers that often. It seems more relevant to look at pingbacks from Alignment Forum posts. I count 62 such AF posts, not including the ones from the sequence itself (and without accounting for redundancy). That’s quite impressive.

Here is a choice of the most interesting from my perspective:

Abram Demski’s Selection vs Control, which crystallized an important dichotomy in how we think about optimizers
Adam Scholl’s Matt Botvinick on the spontaneous emergence of learning algorithms, who attempted to present an example of mesa-optimization, and sparked a big discussion about the meaning of the term, how surprising it should be, and even the need for more RL education in the AI Alignment community (see this comment thread for the “gist”).
Evan Hubinger’s Gradient Hacking, which expanded on the case with deceptive alignment where the trained system can influence only through its behavior what happens next in training. I think this is a big potential issue, which is why I’m investigating it with Evan.
Evan Hubinger’s Clarifying Inner alignment terminology, which replaced the term inner alignment in the context of mesa-optimizers (as defined initially in the sequence), and proposed a decomposition of the alignment problem.

Directions for further research

Mostly, I would be excited with two axes of research around this sequence:

Trying to break the arguments from this sequence. Either poking hole in them and showing why they might not work, or find reasonable assumptions for which they don’t work. Whether holes are found or no attacks breaks the reasoning, I think we will have learned quite a lot.
Trying to make the arguments in this sequence work without the optimization assumption for the learned models. I’m thinking either by assuming that the system will be well predicted by thinking of it as optimizing something, or through a more general idea of goal-directedness. (Evan is also quite interested in this project, so if it excites you, feel free to contact him!)

[-]tom4everitt7y*120

Thanks for the interesting post! I find the possibility of a gap between the base optimization objective and the mesa/behavioral objective convincing, and well worth exploring.

However, I'm less convinced that the distinction between the mesa-objective and the behavioral objective is real/important. You write:

Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).[4] This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.

According to Dennett, many systems behave as if they are optimizing some objective. For example, a tree may behave as if optimizes the amount of sun that it can soak up with its leaves. This is a useful description of the tree, offering real predictive power. Whether there is some actual search process going on in the tree is not that important, the intentional stance is useful in either case.

Similarly, a fully trained DQN algorithm will behave as if it optimizes the score of the game, even though there is no active search process going on at a given time step (especially not if the network parameters are frozen). In neither of these example is it necessary to distinguish between mesa and behavior objectives.

At this point, you may object that the mesa objective will be more predictive "off training distribution". Perhaps, but I'm not so sure.

First, the behavioral objective may be predictive "off training distribution": For example, the DQN agent will strive to optimize reward as long as the Q-function generalizes.

Second, the mesa-objective may easily fail to be predictive off distribution. Consider a model-based RL agent with a learned model of the environment, that uses MCTS to predict the return of different policies. The mesa-objective is then the expected return. However, this objective may not be particularly predictive outside the training distribution, because the learned model may only make sense on the distribution.

So the behavioral objective may easily be predictive outside the training distribution, and the mesa-objective easily fail to be predictive.

While I haven't read the follow-up posts yet, I would guess that most of your further analysis would go through without the distinction between mesa and behavior objective. One possible difference is that you may need to be even more paranoid about the emergence of behavior objectives, since they can emerge even in systems that are not mesa-optimizing.

I would also like to emphasize that I really welcome this type of analysis of the emergence of objectives, not the least because it nicely complements my own research on how incentives emerge from a given objective.

[-]Vlad Mikulik7y70

Thanks for an insightful comment. I think your points are good to bring up, and though I will offer a rebuttal I’m not convinced that I am correct about this.

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Distinguishing the “this system looks like it optimises for X” from “this system internally uses an evaluation of X to make decisions” is useful from the point of view of making the abstraction more robust. The former doesn’t make clear what makes the abstraction “work”, and so when to expect it to fail. The latter will at least tell you what kind of failures to expect in the abstraction: places where the evaluation of X doesn’t connect to the rest of the system like it’s supposed to. In particular, you’re right that if the learned environment model doesn’t generalise, the mesa-objective won’t be predictive of behaviour. But that’s actually a prediction of taking this view. On the other hand, it is unclear if taking the behavioural view would predict that the system will change its behaviour off-distribution (partially, because it’s unclear what exactly grounds the similarities in behaviour on-distribution).

I think it definitely is useful to also think about the behavioural objective in the way you describe, because the later concerns we raise basically do also translate to coherent behavioural objectives. And I welcome more work trying to untangle these concepts from one another, or trying to dissolve any of them as unnecessary. I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Re: DQN

You’re also right to point out DQN as an interesting edge case. But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function. The Q function is regressed to the episode returns. If the learning goes well, the Q function is literally representing the agent’s objective (indeed, it’s not really selected to maximise return; its selected to be accurate at predicting return). Contrast this with e.g. policy optimisation trained agents, which are not supposed to directly represent an objective, but are supposed to score well on it. (Someone good at running RL experiments maybe should look into comparing the coherence of revealed preferences of DQN agents with PPO agents. I’d read that paper.)

[-]tom4everitt7y60

What’s at stake here is: describing basically any system as an agent optimising some objective is going to be a leaky abstraction. The question is, how do we define the conditions of calling something an agent with an objective in such a way to minimise the leaks?

Indeed, this is a super slippery question. And I think this is a good reason to stand on the shoulders of a giant like Dennett. Some of the questions he has been tackling are actually quite similar to yours, around the emergence of agency and the emergence of consciousness.

For example, does it make sense to say that a tree is *trying to* soak up sun, even though it doesn't have any mental representation itself? Many biologists would hesitate to use such language other than metaphorically.

In contrast, Dennett's answer is yes: Basically, it doesn't matter if the computation is done by the tree, or by the evolution that produced the tree. In either case, it is right to think of the tree as an agent. (Same goes for DQN, I'd say.)

There are other situations where the location of the computation matters, such as for consciousness, and for some "self-reflective" skills that may be hard to pre-compute.

Basically, I would recommend looking closer at Dennett to

avoid reinventing the wheel (more than necessary), and
connect to his terminology (since he's so influential).

He's a very lucid writer, so quite a joy to read him really. His most recent book Bacteria to Bach summarizes and references a lot of his earlier work.

I am just wary of throwing away seemingly relevant assumptions about internal structure before we can show they’re unhelpful.

Yes, starting with more assumptions is often a good strategy, because it makes the questions more concrete. As you say, the results may potentially generalize.

But I am actually unsure that DQN agents should be considered non-optimisers, in the sense that they do perform rudimentary optimisation: they take an argmax of the Q function.

I see, maybe PPO would have been a better example.

[-]Vlad Mikulik7y30

I’ve been meaning for a while to read Dennett with reference to this, and actually have a copy of Bacteria to Bach. Can you recommend some choice passages, or is it significantly better to read the entire book?

P.S. I am quite confused about DQN’s status here and don’t wish to suggest that I’m confident it’s an optimiser. Just to point out that it’s plausible we might want to call it one without calling PPO an optimiser.

P.P.S.: I forgot to mention in my previous comment that I enjoyed the objective graph stuff. I think there might be fruitful overlap between that work and the idea we’ve sketched out in our third post on a general way of understanding pseudo-alignment. Our objective graph framework is less developed than yours, so perhaps your machinery could be applied there to get a more precise analysis?

[-]tom4everitt7y40

Chapter 4 in Bacteria to Bach is probably most relevant to what we discussed here (with preceding chapters providing a bit of context).

Yes, it would interesting to see if causal influence diagrams (and the inference of incentives) could be useful here. Maybe there's a way to infer the CID of the mesa-optimizer from the CID of the base-optimizer? I don't have any concrete ideas at the moment -- I can be in touch if I think of something suitable for collaboration!

[-]Rohin Shah6y110

More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL).

Just want to note that I think this is extremely far from a formal definition. I don't know what perfect IRL would be. Does perfect IRL assume that the agent is perfectly optimal, or can it have biases? How do you determine what the action space is? How do you break ties between reward functions that are equally good on the training data?

I get that definitions are hard -- the main thing bothering me here is the "more formally" phrase, not the definition itself. This gives it a veneer of precision that it really doesn't have.

(I'm pedantic about this because similar implied false precision about the importance of utility functions confused me for half a year.)

[-]Vlad Mikulik6y60

You’re completely right; I don’t think we meant to have ‘more formally’ there.

[-]johnswentworth5y100Nomination for 2019 Review

So, this was apparently in 2019. Given how central the ideas have become, it definitely belongs in the review.

[-]abramdemski7y100

I wrote something which is sort of a reply to this post (although I'm not really making a critique or any solid point about this post, just exploring some ideas which I see as related).

[-]Rohin Shah5y90Nomination for 2019 Review

I struggled a bit on deciding whether to nominate this sequence.

On the one hand, it brought a lot more prominence to the inner alignment problem by making an argument for it in a lot more detail than had been done before.

On the other hand, on my beliefs, the framework it presents has an overly narrow view of what counts as inner alignment, relies on a model of AI development that I do not think is accurate, causes people to say "but what about mesa optimization" in response to any advance that doesn't involve mesa optimization even if the advance is useful for other reasons, has led to significant confusion over what exactly does and does not count as mesa optimization, and tends to cause people to take worse steps in choosing future research topics. (I expect all of these claims will be controversial.)

Still, that the conversation is happening at all is a vast improvement over the previous situation of relative (public) silence on the problem. Saying a bunch of confused thoughts is often the precursor to an actual good understanding of a topic. As such I've decided to nominate it for that contribution.

[-]evhub5y80

I think I can guess what your disagreements are regarding too narrow a conception of inner alignment/mesa-optimization (that the paper overly focuses on models mechanistically implementing optimization), though I'm not sure what model of AI development it relies that you don't think is accurate and would be curious for details there. I'd also be interested in what sorts of worse research topics you think it has tended to encourage (on my view, I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design). Also, for the paper giving people a “but what about mesa-optimization” response, I'm imagining you're referring to things like this post, though I'd appreciate some clarification there as well.

[-]Rohin Shah5y70

As a preamble, I should note that I'm putting on my "critical reviewer" hat here. I'm not intentionally being negative -- I am reporting my inside-view beliefs on each question -- but as a general rule, I expect these to be biased negatively; someone looking at research from the outside doesn't have the same intuitions for its utility and so will usually inside-view underestimate its value.

This is also all things I'm saying with the benefit of hindsight, idk what I would have said at the time the sequence was published. I'm not trying to be "fair" to the sequence here, that is, I'm not considering what it would have been reasonable to believe at the time.

the paper overly focuses on models mechanistically implementing optimization

Yup, that's right.

I'm not sure what model of AI development it relies that you don't think is accurate

There seems to be an implicit model that when you do machine learning you get out a complicated mess of a neural net that is hard to interpret, but at its core it still is learning something akin to a program, and hence concepts like "explicit (mechanistic) search algorithm" are reasonable to expect. (Or at least, that this will be true for sufficiently intelligent AI systems.)

I don't think this model (implicit claim?) is correct. (For comparison, I also don't think this model would be correct if applied to human cognition.)

worse research topics you think it has tended to encourage

A couple of examples:

Attempting to create an example of a learned mechanistic search algorithm (I know of at least one proposal that was trying to do this)
Of your concrete experiments, I don't expect to learn anything of interest from the first two (they aren't the sort of thing that would generalize from small environments to large environments); I like the third; the fourth and fifth seem like interesting AI research but I don't think they'd shed light on mesa-optimization / inner alignment or its solutions.

I think this paper should make you more excited about directions like transparency and robustness and less excited about directions involving careful incentive/environment design

I agree with this. Maybe people have gotten more interested in transparency as a result of this paper? That seems plausible.

I'm imagining you're referring to things like this post,

Actually, not that one. This is more like "why are you working on reward learning -- even if you solved it we'd still be worried about mesa optimization". Possibly no one believes this, but I often feel like this implication is present. I don't have any concrete examples at the moment; it's possible that I'm imagining it where it doesn't exist, or that this is only a fact about how I interpret other people rather than what they actually believe.

[-]ESRogs7y80

Very clear presentation! As someone outside the field who likes to follow along, I very much appreciate these clear conceptual frameworks and explanations.

I did however get slightly lost in section 1.2. At first reading I was expecting this part:

which we will contrast with the outer alignment problem of eliminating the gap between the base objective and the intended goal of the programmers.

to say, "... gap between the behavioral objective and the intended goal of the programmers." (In which case the inner alignment problem would be a subcomponent of the outer alignment problem.)

On second thought, I can see why you'd want to have a term just for the problem of making sure the base objective is aligned. But to help myself (and others who think similarly) keep this all straight, do you have a pithy term for "the intended goal of the programmers" that's analogous to base objective, mesa objective, and behavioral objective?

Would meta objective be appropriate?

(Apologies if my question rests on a misunderstanding or if you've defined the term I'm looking for somewhere and I've missed it.)

[-]evhub7y*60

I don't have a good term for that, unfortunately—if you're trying to build an aligned AI, "human values" could be the right term, though in most cases you really just want "move one strawberry onto a plate without killing everyone," which is quite a lot less than "optimize for all human values." I could see how meta-objective might make sense if you're thinking about the human as an outside optimizer acting on the system, though I would shy away from using that term like that, as anyone familiar with meta-learning will assume you mean the objective of a meta-learner instead.

Also, the motivation for choosing outer alignment as the alignment problem between the base objective and the goals of the programmers was to capture the "classical" alignment problem as it has sometimes previously been envisioned, wherein you just need to specify an aligned set of goals and then you're good. As we argue, though, mesa-optimization means that you need more than just outer alignment—if you have mesa-optimizers, you also need inner alignment, as even if your base objective is perfectly aligned, the resulting mesa-objective (and thus the resulting behavioral objective) might not be.

[-]ESRogs7y30

Got it, that's helpful. Thank you!

[-]Rohin Shah7y50

Phrases I've used: [intended/desired/designer's] [objective/goal]

I think "designer's objective" would fit in best with the rest of the terminology in this post, though "desired objective" is also good.

[-]DanielFilan7y80

Another example of trained optimisers that is imo worth checking out is Value Iteration Networks.

[-]Vika6y70

I'm confused about the difference between a mesa-optimizer and an emergent subagent. A "particular type of algorithm that the base optimizer might find to solve its task" or a "neural network that is implementing some optimization process" inside the base optimizer seem like emergent subagents to me. What is your definition of an emergent subagent?

[-]evhub6y*50

I think my concern with describing mesa-optimizers as emergent subagents is that they're not really "sub" in a very meaningful sense, since we're thinking of the mesa-optimizer as the entire trained model, not some portion of it. One could describe a mesa-optimizer as a subagent in the sense that it is "sub" to gradient descent, but I don't think that's the right relationship—it's not like the mesa-optimizer is some subcomponent of gradient descent; it's just the trained model produced by it.

The reason we opted for "mesa" is that I think it reflects more of the right relationship between the base optimizer and the mesa-optimizer, wherein the base optimizer is "meta" to the mesa-optimizer rather than the mesa-optimizer being "sub" to the base optimizer.

Furthermore, in my experience, when many people encounter "emergent subagents" they think of some portion of the model turning into an agent and (correctly) infer that something like that seems very unlikely, as it's unclear why such a thing would actually be advantageous for getting a model selected by something like gradient descent (unlike mesa-optimization, which I think has a very clear story for why it would be selected for). Thus, we want to be very clear that something like that is not the concern being presented in the paper.

[-]Jan Kulveit6y40

I don't see why portion of a system turning into an agent would be "very unlikely". In a different perspective, if the system lives in something like an evolutionary landscape, there can be various basins of attraction which lead to sub-agent emergence, not just mesa-optimisation.

[-]Raemon7y70

Pedagogical-comment – I find it much easier to fit a new term into my vocabulary and models when I have an explanation of why that term was chosen (even if it was sort of idiosyncratic or arbitrary). Why "mesa-optimization"?

[-]evhub7y80

The word mesa is Greek meaning into/inside/within, and has been proposed as a good opposite word to meta, which is Greek meaning about/above/beyond. Thus, we chose mesa based on thinking about mesa-optimization as conceptually dual to meta-optimization—whereas meta is one level above, mesa is one level below.

[-]Raemon7y20

Thanks!

[-]DanielFilan5y60Review for 2019 Review

[NB: this is a review of the paper, which I have recently read, not of the post series, which I have not]

For a while before this paper was published, several people in AI alignment had discussed things like mesa-optimization as serious concerns. That being said, these concerns had not been published in their most convincing form in great details. The two counterexamples that I’m aware of are the posts What does the universal prior actually look like? by Paul Christiano, and Optimization daemons on Arbital. However, the first post only discussed the issue in the context of Solomonoff induction, where the dynamics are somewhat different, and the second is short and hard to discover.

I see the value in this paper as taking these concerns, laying out (a) a better (altho still imperfectly precise) concretization of what the object of concern is and (b) how it could happen, and putting it in a discoverable and citable format. By doing so, it moves the discussion forward by giving people something concrete to actually reason and argue about.

I am relatively convinced that mesa-optimization (somewhat more broadly construed than in the paper, see below) is a problem for AI alignment, and I think the arguments in the paper are persuasive enough to be concerning. I think the weakest argument is in the deceptive alignment section: it is not really made clear why mesa-optimizers would have objectives that extend across parameter updates.

As I see it, the two biggest flaws with the paper are: Its heuristic nature. The arguments given do not reach the certainty of proofs, and no experimental evidence is provided. This means that one can have at most provisional confidence that the arguments are correct and that the concerns are real (which is not to imply that certainty is required to warrant concern and further research). Premature formalization. I do not believe that we have a great characterization of optimization, and as adamShimi points out, it’s not at all clear that search is the right abstraction to use.

Overall, I see the paper as sketching out a research paradigm that I hope to see fleshed out.

[-]Ben Pace5y50Nomination for 2019 Review

I know it’s already been nominated twice, but I still want to nominate it again. This sequence (I’m nominating the sequence) helped me think clearly about optimization, and how delegation works between an optimizer and mesa-optimizer, and what constraints lie between them (e.g. when does an optimizer want a system it’s developing to do optimization?). Changed a lot of the basic ways in which I think about optimization and AI.

[-]Richard_Ngo7y50

We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.

I appreciate the difficulty of actually defining optimizers, and so don't want to quibble with this definition, but am interested in whether you think humans are a central example of optimizers under this definition, and if so whether you think that most mesa-optimizers will "explicitly represent" their objective functions to a similar degree that humans do.

[-]Vlad Mikulik7y40

I think humans are fairly weird because we were selected for an objective that is unlikely to be what we select for in our AIs.

That said, if we model AI success as driven by model size and compute (with maybe innovations in low-level architecture), then I think that the way humans represent objectives is probably fairly close to what we ought to expect.

If we model AI success as mainly innovative high-level architecture, then I think we will see more explicitly represented objectives.

My tentative sense is that for AI to be interpretable (and safer) we want it to be the latter kind, but given enough compute the former kind of AI will give better results, other things being equal.

Here, what I mean by low-level architecture is something like “we’ll use lots of LSTMs instead of lots of plain RNNs, but keep the model structure simple: plug in the inputs, pass it through some layers, and read out the action probabilities”, and high-level is something like “let’s organise the model using this enormous flowchart with all of these various pieces that each are designed to take a particular role; here’s the observation embedding, here’s the search in latent model space, here’s the ...”

[-]Steven Byrnes7y50

This paper replaces a normal feedforward image classifier with a mesa-optimizing one (build generative models of different possibilities and pick the one that best matches the data). The result was better and far more human-like than a traditional image classifier, e.g. the same examples are ambiguous to the model that are ambiguous to humans and vice-versa. I also understand that the human brain is very big into generative modeling of everything. So I expect that ML systems of the future will approach 100% mesa-optimizers, while non-optimizing feedforward NN's will become rare. This post is a good framework and I'm looking forward to follow-ups!

[-]Rohin Shah7y70

I would not call that mesa-optimization and would not take it as evidence that mesa-optimization is the "default" for powerful ML systems. That paper has a model with subagents where each subagent does optimization. Ways in which this is a different thing:

Given an input, a mesa-optimizer would only run on that input once; in the case of this model there are 10 different optimizations happening in order to classify each digit.
The base objective is "correctly map an image of a digit to its label"; the objective of the dth optimizer in the model is "Evidence Lower Bound (ELBO) on the log likelihood of the image as evaluated by a generative model for the digit d". The model optimizers' objectives are not of the right type signature and don't agree with the base objective on the training distribution, as would be the case with a mesa optimizer.

Note that I do think mesa-optimization will be common; I just don't think that that paper is evidence for the claim.

[-]Ben Pace5y30Review for 2019 Review

For me, this is the paper where I learned to connect ideas about delegation to machine learning. The paper sets up simple ideas of mesa-optimizers, and shows a number of constraints and variables that will determine how the mesa-optimizers will be developed – in some environments you want to do a lot of thinking in advance then delegate execution of a very simple algorithm to do your work (e.g. this simple algorithm Critch developed that my group house uses to decide on the rent for each room), and in some environments you want to do a little thinking and then delegate a very complex algorithm to figure out what to do (e.g. evolution is very stupid and then makes very complex brains to figure out what to do in lots of situations that humans encountered in the EEA).

Seeing this more clearly in ML shocked me with the level of inadequacy that ML has for being able to do this with much direction whatsoever. It just doesn't seem like something that we have much control of. Of course I may be wrong, and there are some simple proposals (though that have not worked so far). Nonetheless, it's a substantial step forward in discussing delegation in modern ML systems. It discusses lots of related ideas very clearly.

Definitely should be included in the review. I expect to vote on this with something like +5 to +8.

I don't do research in this area, I expect others like Daniel Filan and Adam Shimi will have more detailed opinions of the sequence's strengths and weaknesses. (Nonetheless I stand by my assessment and will vote accordingly.)

[-]Sam F. Brown4y10

This might be unwelcome nit-picking, but I find it kind of jarring to read "meta is Greek for above, mesa is Greek for below." That's not quite right, μετα is more like 'after' in "turn right after the bridge" and μεσα is more like 'within' (μεσο is like 'middle', as in 'mesoscale'). Above/below could be something like άνω/κάτω (like anode/cathode).

I think the meta/mesa has nice symmetry, and the name is now well-known, but maybe this particular sentence could be made less wrong :p

Also the bibliography link #7 for "What is the opposite of meta?" seems broken for me.

[-]evhub4y40

Sure—I just edited it to be maybe a bit less jarring for those who know Greek.

As a concrete example of what a neural network optimizer might look like, consider TreeQN.(2) TreeQN, as described in Farquhar et al., is a Q-learning agent that performs model-based planning (via tree search in a latent representation of the environment states) as part of its computation of the Q-function. Though their agent is an optimizer by design, one could imagine a similar algorithm being learned by a DQN agent with a sufficiently expressive approximator for the Q function. Universal Planning Networks, as described by Srinivas et al.,(3) provide another example of a learned system that performs optimization, though the optimization there is built-in in the form of SGD via automatic differentiation. However, research such as that in Andrychowicz et al.(4) and Duan et al.(5) demonstrate that optimization algorithms can be learned by RNNs, making it possible that a Universal Planning Networks-like agent could be entirely learned—assuming a very expressive model space—including the internal optimization steps. Note that while these examples are taken from reinforcement learning, optimization might in principle take place in any sufficiently expressive learned system. ↩︎
Previous work in this space has often centered around the concept of “optimization daemons,”(6) a framework that we believe is potentially misleading and hope to supplant. Notably, the term “optimization daemon” came out of discussions regarding the nature of humans and evolution, and, as a result, carries anthropomorphic connotations. ↩︎
The duality comes from thinking of meta-optimization as one layer above the base optimizer and mesa-optimization as one layer below. ↩︎
That being said, some of our considerations do still apply even in that case. ↩︎
Leike et al.(8) introduce the concept of an objective recovered from perfect IRL. ↩︎
For the formal construction of this objective, see pg. 6 in Leike et al.(8) ↩︎
This objective is by definition trivially optimal in any situation that the bottlecap finds itself in. ↩︎
Ultimately, our worry is optimization in the direction of some coherent but unsafe objective. In this sequence, we assume that search provides sufficient structure to expect coherent objectives. While we believe this is a reasonable assumption, it is unclear both whether search is necessary and whether it is sufficient. Further work examining this assumption will likely be needed. ↩︎
The situation with evolution is more complicated than is presented here and we do not expect our analogy to live up to intense scrutiny. We present it as nothing more than that: an evocative analogy (and, to some extent, an existence proof) that explains the key concepts. More careful arguments are presented later. ↩︎
Of course, it might also fail to generalize at all. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

58

Risks from Learned Optimization: Introduction

58

Motivation

Two questions

1.1. Base optimizers and mesa-optimizers

1.2. The inner and outer alignment problems

1.3. Robust alignment vs. pseudo-alignment

1.4. Mesa-optimization as a safety problem