Why GPT wants to mesa-optimize & how we might change this

by John Maxwell8 min read19th Sep 202017 comments

23

GPTMesa-OptimizationMyopiaAI
Frontpage

This post was inspired by orthonormal's post Developmental Stages of GPTs and the discussion that followed, so only part of it is original.

First I'll aim to provide a crisper version of the argument for why GPT wants to mesa-optimize. Specifically, I'll explain a well-known optimization algorithm used in text generation, and argue that GPT can improve performance on its objective by learning to implement something like this algorithm internally.

Then I'll offer some ideas of mine about how we might change this.

Explanation of beam search

Our goal is to generate plausible text. We evaluate whether text is "plausible" by multiplying together all the individual word probabilities from our language model.

Greedy word selection has a problem: Since it doesn't do lookahead, it's liable to get stuck in a dead end. Let's say we give our system the following poem about cheeses and ask it to generate more text:

Mozzarella is white

So you can see it at night

Cheddar is...

If our language model is decent, the word it will assign the highest probability to is "orange". But this creates a problem, because "orange" is a hard word to rhyme.

Beam search is an attempt to solve this problem. Instead of picking the next word greedily, we explore the tree of completions and try to find a multi-word completion that maximizes the product of the individual word probabilities.

Because there are so many words in the English language, the tree grows at a very fast exponential rate. So we choose an integer beam_width for the number of partial completions to track, and each time we take another step deeper into the tree, we discard all but the most plausible beam_width partial completions.

Beam search with a beam width of 2. The bold path corresponds to the maximum-plausibility completion, which would not get discovered by greedy search because "nice" has a higher probability than "dog". Image stolen from this Hugging Face blog post, which has another explanation of beam search if you didn't like mine.

Claim: GPT can do better on its training objective if it learns to do beam search internally

We've discussed text generation with a pretrained language model. Let's switch gears and talk about the model's training process.

Suppose GPT's training corpus has the following poem:

Mozzarella is white

So you can see it at night

Cheddar is marigold

Unless you let it get too old

GPT is trained by giving it some text and asking it to predict the next word. So eventually GPT will be given the example from above

Mozzarella is white

So you can see it at night

Cheddar is...

and be asked to predict the next word.

Let's consider the performance of two models on this task: regular "naive" GPT, and "beam search amplified" GPT. Beam search amplified GPT works by performing beam search using naive GPT, then looking at the distribution of the first words in the resulting completions, then outputting some weighted average of that distribution and the distribution from naive GPT.

Because beam search can find lots of ways to continue the poem using "marigold", but few ways using "orange", beam search amplified GPT's distribution ends up being closer to reality than that of naive GPT:

Distributions are fictional.

So when we update GPT's weights during training, we're shifting the weights towards the sort of computational structure that would make predictions like beam search amplified GPT does.

Does this actually help?

In this instance, GPT has an incentive to do internal lookahead. But it's unclear how frequently these situations actually arise. And maybe it's usually easier to do something else, like learning which words are easy to rhyme.

It would be straightforward to implement beam search amplified GPT (experimenting with different weighted averaging schemes) and check whether it can be made to assign higher plausibility to real text. (It might be best to try with GPT-2 rather than GPT-3, in case GPT-3 is already doing internal lookahead. Note that there's a risk of mesa-optimization developing if lookahead improves performance at any point during GPT's training.)

Is internal lookahead possible for GPT-3?

Relative to other optimization algorithms, it seems to me that beam search would be unusually easy for GPT to implement. Traditional iterative optimization algorithms like gradient descent or simulated annealing require a lot of serial computation, and the number of serial steps GPT can perform is strongly limited. Beam search is way less heavy on the number of serial steps required. The number of available serial steps would still limit the maximum lookahead horizon though.

The transformer architecture learns computations of the form "find some data from the previous step which scores highly according to particular criteria, do some computation on it, pass it on to the next step". That sounds like beam search.

In any case, the topic of what incentives arise while training a language model seems important more generally.

Is internal lookahead dangerous?

If GPT's architecture is capable of discovering lookahead internally, the worry is that GPT might modify and misuse it in creative ways after it's discovered. It might start making plans, or searching for the idea that maximizes some attribute which is correlated with harm.

Let's say there are chess problems in GPT's training corpus which describe a board state along with an objective like "black to move and win in 6 turns even with best play by white". If GPT can do lookahead internally, it can use this to search for game histories where black wins even though white is playing very well. In other words, it's doing spontaneous internal planning. And this spontaneous internal planning is incentivized because it helps predict solutions to chess problems.

Who knows what other contexts spontaneous internal planning might get used in.

Fix idea #1: Switch to BERT style training

How might we remove the incentive for mesa-optimization?

A simple idea is to stop training the task of predicting the next word, and instead train the task of predicting a masked word given the surrounding context. This is what BERT uses. The incentive for internal lookahead seems smaller with this task, but I guess you'd still see it in e.g. predicting masked chess moves of strong players.

BERT's ability to generate text is unclear. But I have a feeling BERT-style training actually offers greater potential than GPT-style training for text generation, because BERT can edit its writing—see discussion here. You could get really creative, like doing genetic algorithms with a population of texts generated on a particular topic, and using BERT to mutate and recombine texts in the population until you get one with really high plausibility. (Some words at the beginning could stay fixed as the "prompt".)

Fix idea #2: Prohibit updates towards lookahead

This fix is more interesting because the core idea seems applicable for stopping mesa-optimization in other contexts too.

v1 of the idea is: During training, before any gradient update, we make a prediction with beam search amplified GPT. And if an update to naive GPT would mean moving it towards beam search amplified GPT (because beam search amplified GPT assigns a higher probability than naive GPT to the actual next word), we just skip the update and keep moving. Nick Bostrom refers to this general approach as "stunting" in Chapter 9 of Superintelligence.

This diagram from above is an example of an update that would get skipped.

Training will take longer, obviously. But only insofar as all this mesa-optimization stuff was a problem in the first place.

v2 of the idea is to add a penalty term to GPT's loss function that causes it to update away from whatever beam search amplified GPT predicts.

A risk with v2 is that GPT is incentivized to emulate beam search amplified GPT so it can avoid making its prediction (i.e. an internal mesa-optimizer gets developed anyway). Skipping updates as in v1 is probably safer, but the fundamental problem seems possible either way.

So v3 is to track gradient updates towards/away from beam search amplified GPT and ensure a rough balance. We don't want systematic movement towards or away from beam search amplified GPT—we don't want incentives to develop either it or its inverse. This could be accomplished by e.g. rescaling gradients on the fly so the average movement towards beam search amplified GPT's answers is around 0. (To be safe, you might want to balance both the number of updates in each direction and their typical magnitude.) An advantage is that training can go faster than in v1 because less updating is skipped.

Beam search isn't the only possible way to do lookahead though.

In particular, the distribution of first words in beam search completions will only assign non-zero probability to words which are the first word of a completion. That leaves room for predictive gains by doing lookahead for other words.

So v4 of the idea is to sample from the infinite beam width distribution, particle filter style, and update in such a way that your expected update towards next-word predictions from the infinite beam width distribution is 0. (Not sure how to do this concretely, but it seems like it might be possible.)

The broader issue here is the nearest unblocked strategy problem. The problem seems relatively tractable in this case, because e.g. beam search based stunting seems to remove the incentive for any sort of lookahead insofar as it helps in the cases where beam search amplified GPT outperforms. (I suspect mesa-optimization is not an all-or nothing proposition, and it will matter how much it has been refined + how readily it is deployed.)

Hence, v5 of the idea—even more underspecified than v4 at this point—is adversarial anti-lookahead stunting. Characterize some kind of "lookahead amplifier" which takes naive GPT as input and attempts to amplify it by doing lookahead-type things in a very freeform way. Train this and make it really good at amplifying GPT to achieve better next-word prediction. Then stunt naive GPT during training by forcing its average update towards the next-word predictions of lookahead-amplified GPT to be 0. (Perhaps you'd want to be retraining the lookahead amplifier repeatedly over the course of GPT's training, in case different lookahead strategies work best at different GPT maturity levels.) In production, go ahead and make use of lookahead-amplified GPT to make performance as competitive as possible. (It wouldn't shock me if this scheme actually produced superior performance, if the lookahead amplifier learns lookahead that GPT couldn't learn for itself, due to a better inductive bias.)

Ideally, GPT's training process would be monitored to ensure that GPT + lookahead in any form consistently outperforms naive GPT. For bonus points, have the lookahead used in the monitoring be a form of lookahead which is not used for stunting (a "dev set").

Do these fixes actually help?

An objection to Fix #2 is the possibility of mesa-optimization which isn't very much like lookahead. For example, if we're training on text that describes a newly discovered animal, the system has an incentive to try & figure out the animal for itself internally so it can better predict how it will be described—and it might make use of some optimization algorithm, genetic algorithms say, to achieve this. (GPT can't really implement a genetic algorithm effectively given its architecture, but still.)

Another objection is that pulling optimization up from the mesa level, as in the "BERT + genetic algorithms" idea or the "lookahead amplifier in production" idea, isn't actually helpful. There's still optimization happening, and the system as a whole could still make devious plans or search for harmful ideas.

However, less mesa-optimization means less risk that transformer blocks develop optimization/planning capabilities and reuse them in contexts we didn't expect. It's easier to reason about searching for text which maximizes plausibility than a mysterious mesa-objective. In particular, an agent that gets instantiated internally might search for side-channel attacks in the text generation machinery and surrounding system (especially risky if GPT has read about this stuff). But it seems very unlikely that a search for plausibility-maximizing text would cause this (except maybe if those attacks somehow got activated during training). Non-mesa-optimization also has parameters that allow us to control its strength without retraining the model, and we have a better understanding of how it works.

There's still a lot of potential for misuse & accidents either way, of course.

OpenAI doesn't offer beam search? Why? Is GPT-3 already mesa-optimizing?

Up until now, I've been pretending that maximizing plausibility (product of individual word probabilities) is a good way to generate text. But beam search doesn't even seem to be an option in the GPT-3 interface. (Please correct me if I'm missing something!)

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Another possibility:

[Generated text:] "I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog."

...

...The generated words following the context are reasonable, but the model quickly starts repeating itself! This is a very common problem in language generation in general and seems to be even more so in greedy and beam search...

...

...Recently, there has been more evidence though that the apparent flaws of greedy and beam search - mainly generating repetitive word sequences - are caused by the model (especially the way the model is trained), rather than the decoding method, cf. Welleck et al. (2019).

From the Hugging Face post (emphasis mine). OK, this thing about language models that find repetitive text plausible sounds like a problem that will eventually get solved. Anything else?

As argued in Ari Holtzman et al. (2019), high quality human language does not follow a distribution of high probability next words. In other words, as humans, we want generated text to surprise us and not to be boring/predictable. The authors show this nicely by plotting the probability, a model would give to human text vs. what beam search does.

So let's stop being boring and introduce some randomness 🤪.

This is a much deeper & more interesting issue IMO. It may be that only a superintelligent language model will find human writing so boringly predictable that every word has high likelihood based on what came before.

Will there be an intermediate stage where prompting a language model with "I just had a brilliant and highly original idea related to X" will cause it to assign higher plausibilities to completions that are actually quite brilliant & original? (Is this the case for GPT-3 already?) I have no idea.

In any case, maybe we could get the benefits of both originality and avoidance of dead ends by sampling from beam search amplified GPT's next-word distribution to generate text? (This could be especially useful if Fix #2 has been applied and the GPT's ability to do lookahead for itself has been stunted.)

Note also that the surprisingness of human text could be an objection to the "GPT can do better on its training objective if it learns to do beam search for itself" claim above. If human text tends to have periodic surprises, using beam search to look for predictable completions may not help performance since those predictions aren't actually very likely.

However, it also may be the case that beam search ends up improving the accuracy of next-word prediction despite the fact that it doesn't generate interesting text.

23

17 comments, sorted by Highlighting new comments since Today at 8:23 PM
New Comment

I'm skeptical that internal beam search would help in language modeling.

Language modeling is like predicting the weather, in the sense that even if you are literally as good as possible at it, your prediction accuracy still degrades rapidly as a function of the number of steps ahead you're looking.  So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

Weather is like this because of chaotic dynamics.  Language modeling is like this because

(a) Text is used to communicate: the writer expects the audience to learn something from the last X% of a text that they couldn't extrapolate from reading the first (100-X)%, or else they'd just stop and not write the remaining X%.

(b) By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.  So even if you were smart enough to guess what any individual human would say next (!), you don't know which human produced the text you're looking at.  (Or even whether it was a human at all.)

Thus (IMO), language modeling is not really about thinking ahead to find some "objectively correct" next move as in Chess/Go.  It's more about trying to guess what the author of this text will do in the very next step.  The author and the LM are almost sure to diverge after a few more steps, so even if the LM had a beam search oracle, I expect it wouldn't find it very useful.

To make the point concrete, I don't think "orange" is necessarily a bad guess here -- among other things, it would be the correct guess if the author were trying to illustrate the point of your example!

And if we were predicting this post itself, the true next token would not be orange or any other word but an ellipsis "...", which seems bizarre from the narrow perspective of the example, but is typical of the wild world LMs operate in.  (Which also contains typos, actually-incoherent writers, mangled formatting, the list goes on . . . )

So a predictor which seems (and is) frighteningly powerful at some short range L will do little better than random guessing if you chain its predictions up to some small multiple of L.

A system which develops small-L lookahead (for L > 1) may find large-L lookahead to be nearby in programspace. If so, incentivizing the development of small-L lookahead makes it more likely that the system will try large-L lookahead and find it to be useful as well (in predicting chess moves for instance).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution. Lookahead is almost certainly going to do better than random guessing, even topic models can do that.

By construction, language modeling gives you nothing to work with except the text itself, so you don't know who produced it or for whom.

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

Are you saying that GPT-3's training corpus was preprocessed to remove information about the author, title, and publication venue? Or are you only talking about what happens when this info is outside the context window?

 

No, it's a more philosophical point.  Even if such things appear in the context window, they're simply more text, and convey the same kind of information: not "the denotation of these words is factually true," but "these words are part of the text."

For example, the mere appearance of something like

Title: Why GPT wants to mesa-optimize & how we might change this 

Author: John_Maxwell

does not guarantee that the text following it bears that title, or was written by that author.  (As I am illustrating right now.)

Of course, one can design datasets where information like this is provided more authoritatively -- say, always at the start of each text, curated for quality, etc.  (GPT isn't like that, but Grover and CTRL kind of are, in different ways.)

But even that can only go so far.  If the author is "Julius Caesar," does that mean the historical figure, some internet poster with that handle, or any number of other possibilities?  A passage of fiction written in a character's voice -- is the appropriate author cue the actual writer (who may have written in many different voices over their career) or the character?  (Note that the character is a much better answer to the question "who does this sound like?")  And doesn't the date matter too, so we know whether this post in the venue "Less Wrong" was on 2010's LW or 2020's?

Fundamentally, language modeling is about understanding structures in decontextualized blocks of contiguous words.  You can try to hack in some sidechannels to provide context, but there's no way they will capture everything needing to locate the text fully in its social, physical, and temporal position within the broader world.  And just as a definitional manner, these sidechannels are modifications to "language modeling," which in its purest sense is just about filling in an arbitrary text from substrings of it (and no other information).

My intuition is that small-L lookahead could be close to large-L lookahead in programspace for something like an RNN, but not for GPT-3's transformer architecture.

Yeah, not for transformers I think.

Anyway, the question here isn't whether lookahead will be perfectly accurate, but whether the post-lookahead distribution of next words will allow for improvement over the pre-lookahead distribution.

capybaralet's point about conservation of expected evidence applies here -- GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

If we then say "the mechanism for pricing them in is doing internal lookahead," then we are imagining that lookahead operating over some predictor that is otherwise good but hasn't priced in lookahead yet.  But I don't know why we should imagine the computation would naturally factor this way, when the benefits of lookahead are small and it beam search take a lot of parameters to implement internally.

Your philosophical point is interesting; I have a post in the queue about that. However I don't think it really proves what you want it to.

Having John_Maxwell in the byline makes it far more likely that I'm the author of the post.

If humans can make useful judgements re: whether this is something I wrote, vs something nostalgebraist wrote to make a point about bylines, I don't see why a language model can't do the same, in principle.

GPT is trying to be optimal at next-step prediction, and an optimal next-step predictor should not get improved by lookahead, it should already have those facts priced in to its next-step prediction.

A perfectly optimal next-step predictor would not be improved by lookahead or anything else, it's perfectly optimal. I'm talking about computational structures which might be incentivized during training when the predictor is suboptimal. (It's still going to be suboptimal after training with current technology, of course.)

In orthonormal's post they wrote:

...GPT-3's ability to write fiction is impressive- unlike GPT-2, it doesn't lose track of the plot, it has sensible things happen, it just can't plan its way to a satisfying resolution.

I'd be somewhat surprised if GPT-4 shared that last problem.

I suspect that either GPT-4 will still be unable to plan its way to a satisfying resolution, or GPT-4 will develop some kind of internal lookahead (probably not beam search, but beam search could be a useful model for understanding it) which is sufficiently general to be re-used across many different writing tasks. (Generality takes fewer parameters.) I don't know what the relative likelihoods of those possibilities are. But the whole idea of AI safety is to ask what happens if we succeed.

Why is beam search missing? One possibility is that GPT-3 already does internal lookahead. OpenAI tried beam search, found it didn't improve text generation, and didn't bother adding it as an option. In other words, GPT-3 is already mesa-optimizing 😲

Beam search has never worked for likelihood-trained NNs, since at least char-RNNs back in 2015. Beam search does trigger repetition and other pathologies in GPT, see "The Curious Case of Neural Text Degeneration", Holtzman et al 2019. And while unlikelihood training seems to help, it's not a silver bullet, and is a bit ad hoc (especially if you think of it in terms of reinforcement learning).

Seq2seq used beam search and found it helped (https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43155.pdf). It was standard practice in the early days of NMT; I'm not sure when that changed.

This blog post gives some insight into why beam search might not be a good idea, and is generally very interesting: https://benanne.github.io/2020/09/01/typicality.html

It still is, it's just that beam search (or other search strategies) seem to be mostly useful for closed-end short text generation; translating a sentence apparently is a task with enough of a right-or-wrong-ness to it that beam search apparently taps into no pathologies. But they get exposed for open-ended longform generation.

In this instance, GPT has an incentive to do internal lookahead. But it's unclear how frequently these situations actually arise

I'm going with "very frequently, perhaps universally". An example I came up with here was choosing "a" vs "an" which depends on the next word.

I think writing many, maybe most, sentences, requires some idea of how the sentence structure is going to be laid out, and that "idea" extends beyond the next token. Ditto at the paragraph level etc.

So I think it already does lookahead in effect, but I don't think it does it by "beam search" per se. I think it's more like "using concepts that extend over many tokens", concepts like "this sentence has the following overall cadence..." and "this sentence conveys the following overall idea..." and "we're in the middle of writing out this particular idiomatic phrase". The training simultaneously incentives both finding the right extended concepts for where you're at in the text, and choosing a good word in light of that context.

This post distinguishes between mesa-optimization and learned heuristics. What you're describing sounds like learned heuristics. ("Learning which words are easy to rhyme" was an example I gave in the post.) Learned heuristics aren't nearly as worrisome as mesa-optimization because they're harder to modify and misuse to do planning in unexpected domains. When I say "lookahead" in the post I'm pretty much always referring to the mesa-optimization sort.

Suppose I said (and I actually believe something like this is true):

"GPT often considers multiple possibilities in parallel for where the text is heading—including both where it's heading in the short-term (is this sentence going to end with a prepositional phrase or is it going to turn into a question?) and where it's heading in the long-term (will the story have a happy ending or a sad ending?)—and it calculates which of those possibilities are most likely in light of the text so far. It chooses the most likely next word in light of this larger context it figured out about where the text is heading."

If that's correct, would you call GPT a mesa-optimizer?

Well I suppose mesa-optimization isn't really a binary is it? Like, maybe there's a trivial sense in which self-attention "mesa-optimizes" over its input when figuring out what to pay attention to.

But ultimately, what matters isn't the definition of the term "mesa-optimization", it's the risk of spontaneous internal planning/optimization that generalizes in unexpected ways or operates in unexpected domains. At least in my mind. So the question is whether this considering multiple possibilities about text stuff could also improve its ability to consider multiple possibilities in other domains. Which depends on whether the implementation of "considering multiple possibilities" looks more like beam search vs very domain-adapted heuristics.

I think the Transformer is successful in part because it tends to solve problems by considering multiple possibilities, processing them in parallel, and picking the one that looks best. (Selection-type optimization.) If you train it on text prediction, that's part of how it will do text prediction. If you train it on a different domain, that's part of how it will solve problems in that domain too.

I don't think GPT builds a "mesa-optimization infrastructure" and then applies that infrastructure to language modeling. I don't think it needs to. I think the Transformer architecture is already raring to go forth and mesa-optimize, as soon as you as you give it any optimization pressure to do so.

So anyway your question is: can it display foresight / planning in a different domain via without being trained in that domain? I would say, "yeah probably, because practically every domain is instrumentally useful for text prediction". So somewhere in GPT-3's billions of parameters I think there's code to consider multiple possibilities, process them in parallel, and pick the best answer, in response to the question of What will happen next when you put a sock in a blender? or What is the best way to fix an oil leak?—not just those literal words as a question, but the concepts behind them, however they're invoked.

(Having said that, I don't think GPT-3 specifically will do side-channel attacks, but for other unrelated reasons off-topic. Namely, I don't think it is capable of make the series of new insights required to develop an understanding of itself and its situation and then take appropriate actions. That's based on my speculations here.)

I didn't read the post (yet...), but I'm immediately skeptical of the claim that beam search is useful here ("in principle"), since GPT-3 is just doing next step prediction (it is never trained on its own outputs, IIUC). This means it should always just match the conditional P(x_t | x_1, .., x_{t-1}). That conditional itself can be viewed as being informed by possible future sequences, but conservation of expected evidence says we shouldn't be able to gain anything by doing beam search if we already know that conditional. Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.

At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).

Now it's true that efficiently estimating that conditional using a single forward pass of a transformer might involve approximations to beam search sometimes.

Yeah, that's the possibility the post explores.

At a high level, I don't think we really need to be concerned with this form of "internal lookahead" unless/until it starts to incorporate mechanisms outside of the intended software environment (e.g. the hardware, humans, the external (non-virtual) world).

Is there an easy way to detect if it's started doing that / tell it to restrict its lookahead to particular domains? If not, it may be easier to just prevent it from mesa-optimizing in the first place. (The post has arguments for why that's (a) possible and (b) wouldn't necessarily involve a big performance penalty.)

My intuitions on this matter are:
1) Stopping mesa-optimizing completely seems mad hard.
2) Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence. 
3) On the other hand, it's probably won't scale forever.

To elaborate on the incentive management thing... if we figure that stuff out and do it right and it has the promise that I think it does... then it won't restrict lookahead to particular domains, but it will remove incentives for instrumental goal seeking.  

If we're still in a situation where the AI doesn't understand its physical environment and isn't incentivized to learn to control it, then we can do simple things like use a fixed dataset (as opposed to data we're collecting online) in order to make it harder for the AI to learn anything significant about its physical environment. 

Learning about the physical environment and using it to improve performance is not necessarily bad/scary absent incentives for control.  However, I worry that having a good world model makes an AI much more liable to infer that it should try to control and not just predict the world.

  1. Stopping mesa-optimizing completely seems mad hard.

As I mentioned in the post, I don't think this is a binary, and stopping mesa-optimization "incompletely" seems pretty useful. I also have a lot of ideas about how to stop it, so it doesn't seem mad hard to me.

  1. Managing "incentives" is the best way to deal with this stuff, and will probably scale to something like 1,000,000x human intelligence.

I'm less optimistic about this approach.

  1. There is a stochastic aspect to training ML models, so it's not enough to say "the incentives favor Mesa-Optimizing for X over Mesa-Optimizing for Y". If Mesa-Optimizing for Y is nearby in model-space, we're liable to stumble across it.

  2. Even if your mesa-optimizer is aligned, if it doesn't have a way to stop mesa-optimization, there's the possibility that your mesa-optimizer would develop another mesa-optimizer inside itself which isn't necessarily aligned.

  3. I'm picturing value learning via (un)supervised learning, and I don't see an easy way to control the incentives of any mesa-optimizer that develops in the context of (un)supervised learning. (Curious to hear about your ideas though.)

My intuition is that the distance between Mesa-Optimizing for X and Mesa-Optimizing for Y is likely to be smaller than the distance between an Incompetent Mesa-Optimizer and a Competent Mesa-Optimizer. If you're shooting for a Competent Human Values Mesa-Optimizer, it would be easy to stumble across a Competent Not Quite Human Values Mesa-Optimizer along the way. All it would take would be having the "Competent" part in place before the "Human Values" part. And running a Competent Not Quite Human Values Mesa-Optimizer during training is likely to be dangerous.

On the other hand, if we have methods for detecting mesa-optimization or starving it of compute that work reasonably well, we're liable to stumble across an Incompetent Mesa-Optimizer and run it a few times, but it's less likely that we'll hit the smaller target of a Competent Mesa-Optimizer.

By managing incentives I expect we can, in practice, do things like: "[telling it to] restrict its lookahead to particular domains"... or remove any incentive for control of the environment.

I think we're talking past each other a bit here.