Gradations of Inner Alignment Obstacles

by Abram Demski13 min read20th Apr 202116 comments

40

Inner AlignmentLottery Ticket HypothesisMesa-OptimizationAI
Frontpage

The existing definitions of deception, inner optimizer, and some other terms tend to strike me as "stronger than necessary" depending on the context. If weaker definitions are similarly problematic, this means we need stronger methods to prevent them! I illustrate this and make some related (probably contentious) claims.

Summary of contentious claims to follow:

  1. The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.
  2. Success at aligning narrowly superhuman models might be bad news.
  3. Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.

It's possible I've shoved too many things into one post. Sorry.

Inner Optimization

The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.

Mesa-Control

I've previously written about the idea of distinguishing mesa-search vs mesa-control:

  • Mesa-searchers implement an internal optimization algorithm, such as a planning algorithm, to help them achieve an objective -- this is the definition of "mesa-optimizer"/"inner optimizer" I think of as standard.
  • Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward an objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.
    • Richard Ngo points out that this definition is rather all-encompassing, since it includes any highly competent policy. Adam Shimi suggests that we think of inner optimizers as goal-directed
    • Considering these comments, I think I want to revise my definition of mesa-controller to include that it is not totally myopic in some sense. A highly competent Q&A policy, if totally myopic, is not systematically "steering the world" in a particular direction, even if misaligned.
    • However, I am not sure how I want to define "totally myopic" there. There may be several reasonable definitions.

I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?

However, I would make the following points:

  • If a mesa-searcher and a mesa-controller are equally effective, they're equally concerning. It doesn't matter what their internal algorithm is, if the consequences are the same.
  • The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.
  • Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
  • "Search" is an incredibly ambiguous concept.
    • There's a continuum between searchers and pure memorized strategies:
      • Explicit brute-force search over a large space of possible strategies.
      • Heuristic search strategies, which combine brute force with faster, smarter steps.
      • Smart strategies like binary search or Newton's method, which efficiently solve problems by taking advantage of their structure, but still involve iteration over possibilities.
      • Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.
      • Mildly-computational strategies, such as decision trees, which approach dumb lookup tables while still capturing meaningful structure (and therefore, meaningful generalization power).
      • Dumb lookup tables.
    • Where are we supposed to draw the line? My proposal is that we don't have to answer this question: we can just include all of them.
  • Some of the reasons we expect mesa-search also apply to mesa-control more broadly.
    • There can be simple, effective strategies which perform well on the training examples, but which generalize in the wrong direction for off-distribution cases. Realistic non-search strategies will not actually be lookup tables, but rather, will compress the strategies a lot. Such agents probably follow perverse instrumental incentives because it's a common theme of effective strategies, even without search-based planning.
    • Non-search strategies can still factor their knowledge into "knowledge of the goal" vs "knowledge of the world", and combine the two to plan. (For example, the calculus-like optimization I mentioned.) This gives us a critical ingredient for deceptive agents: the training score can be improved by increasing the knowledge in the world-model instead of aligning the goal, such that the agent deceptively cooperates with the training task to achieve its own goals.
    • For non-search strategies, it's even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there's even more reason to think that mesa-controllers of this type wouldn't be aligned with the outer goal.

Mesa-Learning

I mentioned this category in the same Mesa-Search vs Mesa-Control post. This refers to the phenomenon of spontaneous emergence of learning algorithms. Basically: a system ends up learning-to-learn when you were only trying to get it to learn. This may or may not involve search.

This could be concerning/important for several reasons, but I don't have a lot I want to say about it in this post.

Explicitly Representing Values

This refers to one of the properties I mentioned in the mesa-control subsection: does a model represent its objective separately from its world-model, and combine those to plan? 

Or, slightly more generally: does the system have an explicitly represented objective? (Whether or not it has a "world model".)

There are several reasons to think this might be the critical distinction for pointing to inner optimization:

  • It's critical to one story for why we might expect deception from highly capable machine-learning systems, as I previously outlined.
  • Representing a goal explicitly seems required for "having a misaligned goal" in a significant sense. (In other words, this just seems like a very natural definition. A system which doesn't recognize some goal as a regularity behind its strategy doesn't "have a goal" in a mechanistic sense.)
  • A system that does not do this has little reason to be systematically misaligned.
    • That is: even if one "misaligned behavior" is learned as a generalization of effective strategies in the training data, there is little/no reason to expect another misaligned behavior to be learned (particularly not misaligned in the same direction, that is, pursuing the same misaligned goal) unless the system has compressed its strategies in terms of an explicitly represented objective. Therefore, one might argue that there is no reason to expect high levels of capability toward misaligned goals without such factoring.

I don't think these arguments are enough to supersede (misaligned) mesa-control as the general thing we're trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it's important to keep this in mind as a potential definition.

Generalizing Values Poorly

This section would be incomplete without mentioning another practical definition: competently pursuing a different objective when put in a different context.

This is just the idea that inner optimizers perform well on the training data, but in deployment, might do something else. It's little more than the idea of models generalizing poorly due to distributional shift. Since learning theory deals extensively with the idea of generalization error, this might be the most pragmatic way to think about the problem of inner optimization.

I'll have more to say about this later.

Deception

Evan Hubinger uses "deceptive alignment" for a strong notion of inner alignment failure, where:

  1. There is an inner optimizer. (Evan of course means a mesa-searcher, but we could substitute other definitions.)
  2. It is misaligned; it has an objective which differs from the training objective.
  3. It is non-myopic: its objective stretches across many iterations of training.
  4. It understands the training process and its place within it.
  5. In order to preserve its own values, it "cooperates" with the training process (deceptively acting as if it were aligned).

I find that I often (accidentally or purposefully) use "deception" to indicate lesser crimes. 

Hidden (possibly "inaccessible") Information

The intuition here is that a "deceptive" system is one that is hiding something from us; it knows more than it is letting on. For example, a system which is using a lot of inaccessible information.

When I read The Case for Aligning Narrowly Superhuman Models, my reaction was:

  1. This seems like a great experiment.
  2. I expect the experiment to "succeed". That is: I expect you totally can find a training procedure which "aligns" GPT-3 better, in the sense of getting it to give more useful responses to questions.
  3. Contrary to what I perceive most people as thinking, I think success would be bad news, and the greater the success, the worse the news (in a sense I'll try to clarify).

My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates. I expect that if a prompt begins with reasoning mistakes, GPT-3 will continue with similar reasoning mistakes to some extent.

So it seems quite plausible that we can nudge smarter behavior out of GPT-3. For example, with the following procedure inspired by Nisan's paper that Ajeya cites:

  1. Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing "intelligent and helpful responses".
  2. Use this upvote predictor as a loss function to fine-tune GPT-3.

I'm thinking of Ajeya's research program roughly as "come up with some better version of that".

So then a very interesting question is: would that work? How well?

Assuming step 2 doesn't just effectively create adversarial examples for the model trained in step 1 (a failure mode which Nisan's work suggests won't happen), I think the procedure would work at least to some extent. EG, it would at least stop GPT from introducing "intentional" reasoning errors. GPT-3 fine-tuned in this way would probably be more helpful than GPT-3 in a meaningful way.

Here's my possibly-contentious claim:

The better this works, the move evidence we have that GPT is close to worrisome territory WRT inner optimizers.

If this procedure barely works at all, then we have evidence that GPT was not "hiding anything" or "playing dumb". This means GPT was not very much like an inner optimizer. It didn't have very much private information it was using to model humans, without revealing to the humans.

If this procedure works moderately well, then this provides some confirmation that GPT had some hidden information.

If this procedure worked quite well, producing a model which answers questions quite intelligently and helpfully, this provides significant evidence that GPT-3 had a kind of inner optimizer. In particular, if this procedure was able to produce better answers than those in the Stackxchange dataset, this would suggest that GPT-3 knew those better answers all along, rather than developing better answers as a consequence of training on that signal.

If this procedure works exceedingly well, causing GPT to "wake up" and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn't then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)

Barring the final extreme outcome which creates a risk, this is a great experiment which could provide some valuable information about the potential for near-term inner optimizers.

But I worry that people could interpret the experiment incorrectly, thinking that "good" results from this experiment (ie creating much more helpful versions of GPT) are actually "good signs" for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.

This does, of course, depend highly on methodology. If the methodology is not very much like what I describe, then positive results might indeed be good news rather than bad.

Also note that hiding information does not in itself imply inner optimization, much less inner misalignment. Ajeya privately mentioned the example of ImageNet: certainly ImageNet knows more about images than it overtly reveals through its outputs.

However, it does imply some sort of alignment problem, I think, because arguably alignment implies the absence of hidden information. In the GPT-3 example, this can clearly be traced to an outer alignment problem: GPT-3 was trained to imitate humans, not to give the most useful responses possible. So GPT-3 hiding information does strike me as an example of a misaligned meza-optimizer even if not an inner-misaligned mesa-optimizer. (Depending, of course, on your preferred definition of mesa-optimizer. I have no idea whether GPT-3 conducts an internal search. Planning ahead seems like a broadly useful thing for it to do, but, we know little about GPT-3's internal strategies.)

(In an extreme case, an aligned AI might hide information from us for our own sake. However, this at least implies an absence of corrigibility, since it results in difficult-to-verify and difficult-to-correct behavior. I don't feel bad about a definition of "deception" which includes this kind of behavior; avoiding this kind of deception seems like a worthwhile goal.)

A Treacherous Turn

The core reason why we should be interested in Evan's notion of deception is the treacherous turn: a system which appears aligned until, at an opportune moment, it changes its behavior.

So, this serves as a very practical operational definition.

Note that this is identical with the "generalizing values poorly" definition of inner optimizer which I mentioned.

My Contentious Position for this subsection:

Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.

The argument goes like this:

  1. Call our actual training regime T.
  2. I claim that if we're clever enough, we can construct a hypothetical training regime T' which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven't been able to find it yet.) ETA: Gwern points to "poisoning backdoor attacks".
  3. Lottery-ticket thinking suggests that the "lottery ticket" which allows T' to work is already present in the NN when we train on T.
  4. (Furthermore, it's plausible that training on T can pretty easily find the lottery ticket which T' would have found. The training on T has no reason to "reject this lottery ticket", since it performs well on T. So, there may be a good chance that we get an NN which behaves as if it were trained on T'.)

Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.

So, what do we think of the argument?

I actually came up with this argument as an argument against a specific form of the lottery ticket hypothesis, thinking the conclusion was pretty silly. The mere existence of T' doesn't seem like sufficient reason to expect a treacherous turn from training on T.

However, now I'm not so sure.

If true, this would argue against certain "basin of corrigibility" style arguments where we start with the claim that the initialized NN is not yet deceptive, and then use that to argue inductively that training does not produce deceptive agents.

40

19 comments, sorted by Highlighting new comments since Today at 3:53 PM
New Comment

I claim that if we're clever enough, we can construct a hypothetical training regime T' which trains the NN to do nearly or exactly the same thing on T, but which injects malign behavior on some different examples. (Someone told me that this is actually an existing area of study; but, I haven't been able to find it yet.)

I assume they're referring to data poisoning backdoor attacks like https://arxiv.org/abs/2010.12563 or https://arxiv.org/abs/1708.06733 or https://arxiv.org/abs/2104.09667

Cool post! It's clearly not super polished, but I think you're pointing at a lot of important ideas, and so it's a good thing to publish it relatively quickly.

The standard definition of "inner optimizer" refers to something which carries out explicit search, in service of some objective. It's not clear to me whether/when we should focus that narrowly. Here are some other definitions of "inner optimizer" which I sometimes think about.

As far as I understand it, the initial assumption of internal search was mostly done for two reasons: because then you can speak of the objective/goal without a lot of the issues around behavioral objectives; and because the authors of the Risk from Learned Optimization paper felt that they needed assumptions about the internals of the system to say things like "training and generalization incentivize mesa-optimization".

But personally, I really think of inner alignment in terms of goal-directed agents with misaligned goals. That's by the way one reason why I'm excited to work on deconfusing goal-directedness: I hope this will allow us to consider broader inner misalignment.

With that perspective, I see the Risks paper as arguing that when pushed at the limit of competence, optimized goal-directed systems will have a simple internal model built around a goal, instead of being a mess of heuristics as you could expect at intermediary levels of competence. But I don't necessarily think this has to be search.

I don't think these arguments are enough to supersede (misaligned) mesa-control as the general thing we're trying to prevent, but still, it could be that explicit representation of values is the definition which we can build a successful theory around / systematically prevent. So value-representation might end up being the more pragmatically useful definition of mesa-optimization. Therefore, I think it's important to keep this in mind as a potential definition.

The argument I find the most convincing for the internal representation (or at least awareness/comprehension) is that it is required for very high-level of competence towards the goal (for complex enough goals, of course). I guess that's probably similar (though not strictly the same) to your point about the "systematically misaligned".

But I worry that people could interpret the experiment incorrectly, thinking that "good" results from this experiment (ie creating much more helpful versions of GPT) are actually "good signs" for alignment. I think the opposite is true: successful results would actually be significant reason for caution, and the more success, the more reason for caution.

Your analysis of making GPT-3 made me think a lot of this great blog post (and great blog) that I just read today. The gist of this and other posts there is to think of GPT-3 as a "multiverse-generator", simulating some natural language realities. And with the prompt, the logit-bias and other aspects, you can push it to priviledge certain simulations. I feel like the link with what you're saying is that making GPT-3 useful in that sense seems to push it towards simulating realities consistent/produced by agents, and so to almost optimize for an inner alignment problem.

Some versions of the lottery ticket hypothesis seem to imply that deceptive circuits are already present at the beginning of training.

I haven't thought enough/studied enough the lottery ticket hypotheses and related idea to judge if your proposal makes sense, but even accepting it, I'm not sure it forbids basins of attraction. It just says that when the deceptive lottery ticket is found enough, then there is no way back. But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.

I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?

But that seems to me like something that Evan says quite often, which is that once the model is deceptive you can't expect it to go back to non-deceptiveness (mabye because stuff like gradient hacking). Hence the need for a buffer around the deceptive region.

I guess the difference is that instead of the deceptive region of the model space, it's the "your innate deceptiveness has won" region of the model space?

Right, so, the point of the argument for basin-like proposals is this:

A basin-type solution has to 1. initialize in such a way as to be within a good basin / not within a bad basin. 2. Train in a way which preserves this property. Most existing proposals focus on (2) and don't say that much about (1), possibly counting on the idea that random initializations will at least not be actively deceptive. The argument I make in the post is meant to question this, pointing toward a difficulty in step (1).

One way to put the problem in focus: suppose the ensemble learning hypothesis:

Ensemble learning hypothesis (ELH): Big NNs basically work as a big ensemble of hypotheses, which learning sorts through to find a good one.

This bears some similarity to lottery-ticket thinking.

Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).

But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.

This argument is obviously a bit sloppy, though.

Now, according to ELH, we might expect that in order to learn deceptive or non-deceptive behavior we start with an NN big enough to represent both as hypotheses (within the random initialization).

But if our training method (for part (2) of the basin plan) only works under the assumption that no deceptive behavior is present yet, then it seems we can't get started.

This argument is obviously a bit sloppy, though.

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

On the other hand, if there's just a tiny probability or tiny part of deception in the model (not sure exactly what this means), then I expect that there are small updates that SGD can do that don't make the model more deceptive (and maybe make it less deceptive) and yet reduce the loss. That's the intuition that to learn that lying is a useful strategy, you must actually be "good enough" at lying (maybe by accident) to gain from it and adapt to it. I have friends who really suck at lying, and for them trying to be deceptive is just not worth it (even if they wanted to).

If you actually need deceptiveness to be strong already to have this issue, then I don't think your ELH points to a problem because I don't see why deceptiveness should dominate already.

I guess the crux here is how much deceptiveness do you need before the training method is hijacked. My intuition is that you need to be relatively competent at deceptiveness, because the standard argument for why let's say SGD will make good deceptive models more deceptive is that making them less deceptive would mean bigger loss and so it pushes towards more deception.

I agree, but note that different methods will differ in this respect. The point is that you have to account for this question when making a basin of attraction argument.

Agreed, it depends on the training process.

Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

Also, why include "misaligned" in this definition? If mesa-controller turns out to be a useful concept, then I'd want to talk about both aligned and misaligned mesa-controllers.

Right, agreed, I'll consider editing.

I'm confused about what wouldn't qualify as a mesa-controller. In practice, is this not synonymous with "capable"?

Do you think that's a problem?

Do you think that's a problem?

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

I guess the mesa- prefix helps point towards the fact that we're talking about policies, not policies + optimisers.

Probably my preferred terminology would be:

  • Instead of mesa-controller, "competent policy".
  • And then we can say that competent policies sometimes implement search or learning (or both, or possibly neither).
  • And when we need to be clear, we can add the mesa- prefix to search or learning. (Although I'm not sure whether something like AlphaGo is a mesa-searcher - does the search need to be emergent?)

This helps make it clear that mesa-controller isn't a disjoint category from mesa-searcher, and also that mesa-controller is the default, rather than a special case.

Having written all this I'm now a little confused about the usefulness of the mesa-optimisation terminology at all, and I'll need to think about it more. In particular, it's currently unclear to me what the realistic alternative to mesa-optimisation is, which makes me wonder if it's actually carving off an important set of possibilities, or just reframing the whole space of possibilities. (If the policy receives a gradient update every minute, is it useful to call it a mesa-optimiser? Or every hour? Or...)

I'm inclined to think so, mostly because terms shouldn't be introduced unnecessarily. If we can already talk about systems that are capable/competent at certain tasks, then we should just do that directly.

Thinking about this more, I think maybe what I really want it to mean is: competent policies which are non-myopic in some sense. A truly myopic Q&A system doesn't feel much like a controller / inner optimizer (even if it is misaligned, it's not steering the world in a bad direction, because it's totally myopic).

I'm not sure what sense of "myopia" I want to use, though.

To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.

I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

I tend to think of the latter as more compressed,

I'm not sure what you meant by "more compressed".

I used to define "agent" as "both a searcher and a controller", IE, something which uses an internal selection/search of some kind to accomplish an external control task. This might be too restrictive, though.

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment.

The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond.

It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex.

Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step.

Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point.

In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search.

(As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways to do selection than search. For example, if I construct a mathematical proof (or a poem) by writing it one line at a time, letting my intuition guide me, then it doesn't really seem accurate to say that I'm searching over the space of proofs/poems. Similarly, a chain of reasoning may not branch much, but still end up finding a highly specific conclusion. Yet "selection" also doesn't really seem like the right word either, because it's at odds with normal usage, which involves choosing from a preexisting set of options - e.g. you wouldn't say that a poet is "selecting" a poem. How about "design" as an alternative? Which allows us to be agnostic about how the design occurred - whether it be via a control process like evolution, or a process of search, or a process of reasoning.)

Part of my idea for this post was to go over different versions of the lottery ticket hypothesis, as well, and examine which ones imply something like this. However, this post is long enough as it is.

I'd love to see you do this!

Re: The Treacherous Turn argument: What do you think of the following spitball objections:

(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.

(b) Maybe the deceptive ticket that makes T' work is not deceptive from the beginning, but rather is made so by the training process T'. If instead you just give it T, it does not exhibit malign off-T behavior. (Analogy: Maybe I can take you and brainwash you so that you flip out and murder people when a certain codeword reaches your ear, and moreover otherwise act completely normally so that you'd react exactly the same way to everything in your life so far as you in fact have. If so, then the "ticket" that makes this possible is already present inside you, even now as you read these words! But the 'ticket' is just you. And you won't actually flip out and murder people if the codeword reaches your ear, because you haven't in fact been brainwashed.)

(a) Maybe the deceptive ticket that makes T' work is indeed there from the beginning, but maybe it's outnumbered by 'benign' tickets, so that the overall behavior of the network is benign. This is an argument against premise 4, the idea being that even though the deceptive ticket scores just as well as the rest, it still loses out because it is outnumbered.

My overall claim is that attractor-basin type arguments need to address the base case. This seems like a potentially fine way to address the base-case, if the math works out for whatever specific attractor-basin argument. If we're trying to avoid deception via methods which can steer away from deception if we assume there's not yet any deception, then we're in trouble; the technique's assumptions are violated.

(b) Maybe the deceptive ticket that makes T' work is not deceptive from the beginning, but rather is made so by the training process T'.

Right, this seems in line with the original lottery ticket hypothesis, and would alleviate the concern. It doesn't seem as consistent with the tangent space hypothesis, though.

The most useful definition of "mesa-optimizer" doesn't require them to perform explicit search, contrary to the current standard.

And presumably, the extent to which search takes place isn't important, a measure of risk, or optimizing. (In other words, it's not a part of the definition, and it shouldn't be a part of the definition.)


Some of the reasons we expect mesa-search also apply to mesa-control more broadly.

expect mesa-search might be a problem?


Highly knowledge-based strategies, such as calculus, which find solutions "directly" with no iteration -- but which still involve meaningful computation.

This explains 'search might not be the only problem' rather well (even if isn't the only alternative).


Dumb lookup tables.

Hm. Based on earlier:

Mesa-controller refers to any effective strategies, including mesa-searchers but also "dumber" strategies which nonetheless effectively steer toward a misaligned objective. For example, thermostat-like strategies, or strategies which have simply memorized a number of effective interventions.

It sounds like there's also a risk of smart lookup tables. That might not be the right terminology, but 'look up tables which contain really effective things', even if the tables themselves just execute and don't change, seems worth pointing out somehow.


I think mesa-control is thought of as a less concerning problem than mesa-search, primarily because: how would you even get severely misaligned mesa-controllers? For example, why would a neural network memorize highly effective strategies for pursuing an objective which it hasn't been trained on?
  • AgentOne learns to predict AgentTwo so they don't run into each other as they navigate their environment and try to pursue their own goals or strategies (jointly or separately).
  • Something which isn't a neural network might?
  • If people don't want to worry about catastrophic forgetting, they might just freeze the network. (Training phase, thermostat phase.)
  • Someone copies a trained network, instead of training from scratch - accidentally.
  • Malware

The point of inner alignment is to protect against those bad consequences. If mesa-controllers which don't search are truly less concerning, this just means it's an easier case to guard against. That's not an argument against including them in the definition of the inner alignment problem.

A controller, mesa- or otherwise, may be a tool another agent creates or employs to obtain their objectives. (For instance, if someone creates malware that hacks your thermostat to build a bigger botnet (yay Internet of Things!). It might be better to think of the 'intelligence/power/effectiveness of an object for reaching a goal' (even for a rock) to be seen as a function of the system, rather than the parts.)

If you used your chess experience to create a lookup table that could beat me at chess, it's 'intelligence' would be an expression of your int/optimization.


For non-search strategies, it's even more important that the goal actually simplify the problem as opposed to merely reiterate it; so there's even more reason to think that mesa-controllers of this type wouldn't be aligned with the outer goal.

How does a goal simplify a problem?


My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.

Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)


Train a model to predict upvotes on Quara, Stackxchange, and similar question-answering websites. This serves as a function recognizing "intelligent and helpful responses".

Uh, that's not what I'd expect it to do. If you're worried about deception now, why don't you think that'd make it worse? (If nothing else, are you trying to create GPT-Flattery?)


If this procedure works exceedingly well, causing GPT to "wake up" and be a human-level conversation partner or greater, we should be very worried indeed. (Since we wouldn't then know the alignment of the resulting system, and could be virtually sure that it was an inner optimizer of significant power.)

It's not an agent. It's a predictor. (It doesn't want to make paperclips.)

I think you're anthropomorphizing it.

expect mesa-search might be a problem?

What I intended there was "expect mesa-search to happen at all" (particularly, mesa-search with its own goals)

It sounds like there's also a risk of smart lookup tables. That might not be the right terminology, but 'look up tables which contain really effective things', even if the tables themselves just execute and don't change, seems worth pointing out somehow.

Sorry, by "dumb" I didn't really mean much, except that in some sense lookup tables are "not as smart" as the previous things in the list (not in terms of capabilities, but rather in terms of how much internal processing is going on).

How does a goal simplify a problem?

For example, you can often get better results out of RL methods if you include "shaping" rewards, which reward behaviors which you think will be useful in productive strategies, even though this technically creates misalignment and opportunities for perverse behavior. For example, if you wanted an RL agent to go to a specific square, you might do well to reward movement toward that square.

Similarly, part of the common story about how mesa-optimizers develop is: if they have explicitly represented values, these same kinds of "shaping" values will be adaptive to include, since they guide the search toward useful answers. Without this effect, inner search might not be worthwhile at all, due to inefficiency.

Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)

Yes, I agree that GPT's outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.

However, I feel like your "if you don't want that, then..." seems to suppose that it's easy to make it outer-aligned. I don't think so.

The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations -- or similarly, we could just apply a loss function for outputs which aren't spelled correctly). But what's the generalization of that?? How do you try to discourage all "deliberate mistakes"? 

Uh, that's not what I'd expect it to do. If you're worried about deception now, why don't you think that'd make it worse? (If nothing else, are you trying to create GPT-Flattery?)

I don't think it would be entirely aligned by any means. My prediction is that it'd be incentivized to reveal information (so you could say it's differentially more "honest" relative to GPT-3 trained only on predictive accuracy). I agree that in the extreme case (if fine-tuned GPT-3 is really good at this) it could end up more deceptive rather than less (due to issues like flattery).

It's not an agent. It's a predictor. (It doesn't want to make paperclips.)

I think you're anthropomorphizing it.

  1. This was meant to be an extreme case.
  2. Why do you suppose it's not an agent? Isn't that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?
How do you try to discourage all "deliberate mistakes"? 

1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?

*This may be different from 'maximize score, or wins long term'. If you try to avoid teaching your opponent how to play better, while seeking out wins, there can be a 'try to meta game' approach - though this might require games to have the right structure, especially in training to create a tournament, rather than game focus. And I would guess it is game focused, rather than tournament.


Why do you suppose it's not an agent? Isn't that essentially the question of inner optimizers? IE, does it get its own goals? Is it just trying to predict?

A fair point. Dealing with this at the level of 'does it have goals' is a question worth asking. I think that it, like AlphaGo, isn't engaging in particularly deliberate action because I don't think it is existing properly to do that, or learn to do that.


You think of the spelling errors as deception. Another way of characterizing it might be 'trying to speak the lingo'. For example we might think of as an agent, that, if it chatted with you for a while, and you don't use words like 'aint' a lot, might shift to not use words like that around you. (Is an agent that "knows its audience" deceptive? Maybe yes, maybe no.)

You think that there is a correct way to spell words. GPT might be more agnostic. For example, (it's weird to not put this in terms of prediction) if another version of GPT (GPT-Speller) somehow 'ignored context', or 'factored it 'better'', then we might imagine Speller would spell words right with a probability. You and I understand that 'words are spelled (mostly) one way'. But Speller, might come up with words as these probability distributions over strings - spelling things right most of the time (if the dataset has them spelled that way most of the time), but always getting them wrong sometimes because it:

  • Thinks that's how words are. (Probability blobs. Most of the time "should" should be spelled "should", but 1% or less it should be spelled "shoud".)
  • Is very, but not completely certain it's got things right. Even with the idea that there is one right way, there might be uncertainty about what that way is. (I think an intentional agent like us, as people, at some point might ask 'how is this word spelled', or pay attention to scores it gets, and try to adjust appropriately.**)

**Maybe some new (or existing) methods might be required to fix this? The issue of 'imperfect feedback' sounds like something that's (probably) been an issue before - and not just in conjunction with the words 'Goodhart'.


I also lean towards 'this thing was created, and given something like a goal, and it's going to keep doing that goal like thing'. If it 'spells things wrong to fit in' that's because it was trained as a predictor, not a writer. If we want something to write, yeah, figuring out how to train that might be hard. If you want something out of GPT that differs from the objective 'predict' then maybe GPT needs to be modified, if prompting it correctly doesn't work. Given the way it 'can respond to prompts' characterizing it as 'deceptive' might make sense under some circumstances*, but if you're going to look at it that way, training something to do 'prediction' (of original text) and then have it 'write' is systematically going to result in 'deception' because it has been trained to be a chameleon. To blend in. To say what whoever wrote the string it is being tested against at the moment. It's abilities are shocking and it's easy to see them in an 'action framework'. However, if it developed a model of the world, and it was possible to factor that out from the goal - then pulling the model out and getting 'the truth' is possible. But the two might not be separable. If trained on say "a flat earther dataset" will it say "the earth is round"? Can it actually achieve insight?

If you want a good writer, train a good writer. I'm guessing garbage in, garbage out, is an AI rule as much as straight up programming.*** If we give something the wrong rewards, the system will be gamed (absent a system (successfully) designed and deployed to not do that).

*i.e., it might have a mind, but it also might not. Rather it might just be that

***More because the AI has to 'figure out' what it is that you want, from scratch.


If GPT, when asked 'is this spelled correctly: [string]' it tells us truthfully, then as deception, that's probably not an issue. As far as deception goes...arguably it's 'deceiving' everyone all the time, that it is a human (assuming most text in it's corpus is written by humans, and most prompts match that), or trying to. If it things it's supposed to play the part of a someone who is bad at spelling, it might be hard to read.

(I haven't heard of it making any new scientific discoveries*. Though if it hasn't read a lot of papers, it could be trained...)

*This would be surprising, and might change the way I look at it - if a predictor can do that, what else can it do, and is the distinction between an agent an a predictor a meaningful one? Maybe not. Though pre-registration might be key here. If most of the time it just produces awful or mediocre papers, then maybe it's just a 'monkey at a typewriter'.

I'm a bit confused about part of what we're disagreeing on, so, context trace:

I originally said:

My model is that GPT-3 almost certainly is "hiding its intelligence" at least in small ways. For example, if its prompt introduces spelling mistakes, GPT-3 will 'intentionally' continue with more spelling mistakes in what it generates.

Then you said:

Yeah, because it's goal is prediction. Within prediction there isn't a right way to write a sentence. It's not a spelling mistake, it's a spelling prediction. (If you want it to not do that, then train it on...predicting the sentence, spelled correctly. Reward correct spelling, with a task of 'seeing through the noise'. You could try going further, and reinforce a particular style, or 'this word is better than that word'.)

Then I said:

Yes, I agree that GPT's outer objective fn is misaligned with maximum usefulness, and a more aligned outer objective would make it do more of what we would want.

However, I feel like your "if you don't want that, then..." seems to suppose that it's easy to make it outer-aligned. I don't think so.

The spelling example is relatively easy (we could apply an automated spellcheck to all the data, which would have some failure rate of course but is maybe good enough for most situations -- or similarly, we could just apply a loss function for outputs which aren't spelled correctly). But what's the generalization of that?? How do you try to discourage all "deliberate mistakes"? 

Then you said:

1. Make something that has a goal. Does AlphaGo make deliberate mistakes at Go? Or does it try to win, and always make the best move* (with possible the limitation that, it might not be as good at playing from positions it wouldn't play itself into)?

  1. It seems like the discussion was originally about hidden information, not deliberate mistakes -- deliberate mistakes were just an example of GPT taking information-hiding actions. I spuriously asked how to avoid all deliberate mistakes when what I intended had more to do with hidden information
  2. The claim I was trying to support in that paragraph was (as stated in the directly preceding paragraph) it isn't easy to make it outer-aligned. AlphaGo isn't outer-aligned.
  3. AlphaGo could be hiding a lot of information, like GPT. In AlphaGo's case, information which AlphaGo doesn't reveal to the user would include a lot of concepts about the state of the game, which aren't revealed to human users easily. This isn't particularly sinister, but, it is hidden information.
  4. A hypothetical more-data-efficient AlphaGo which was trained only on playing humans (rather than self-play) could have an internal psychological model of humans. This would be "inaccessible information". It could also implement deliberate deception to increase its win rate.

I get the vibe that I might be missing a broader point you're trying to make. Maybe something like "you get what you ask for" -- you're pointing out that hiding information like this isn't at all surprising given the loss function, and different loss functions imply different behavior, often in a straightforward way.

If this were your point, I would respond:

  • The point of the inner alignment problem is that, it seems, you don't always get what you ask for.
  • I'm not trying to say it's surprising that GPT would hide things in this way. Rather, this is a way of thinking about how GPT thinks and how sophisticated/coherent its internal world-model is (in contrast to what we can see by asking it questions). This seems like important, but indirect, information about inner optimizers.

You think of the spelling errors as deception. Another way of characterizing it might be 'trying to speak the lingo'. For example we might think of as an agent, that, if it chatted with you for a while, and you don't use words like 'aint' a lot, might shift to not use words like that around you. (Is an agent that "knows its audience" deceptive? Maybe yes, maybe no.)

You think that there is a correct way to spell words. GPT might be more agnostic.

I'm not sure whether there is any disagreement here. Certainly I tend to think about language differently from that. But I agree that's the purely descriptive view.

I also lean towards 'this thing was created, and given something like a goal, and it's going to keep doing that goal like thing'. If it 'spells things wrong to fit in' that's because it was trained as a predictor, not a writer.

I mean, I agree as a statistical tendency, but are you assuming away the inner alignment problem?

Given the way it 'can respond to prompts' characterizing it as 'deceptive' might make sense under some circumstances*, but if you're going to look at it that way, training something to do 'prediction' (of original text) and then have it 'write' is systematically going to result in 'deception' because it has been trained to be a chameleon. To blend in.

We seem to be in agreement about this.

However, if it developed a model of the world, and it was possible to factor that out from the goal - then pulling the model out and getting 'the truth' is possible. But the two might not be separable. If trained on say "a flat earther dataset" will it say "the earth is round"? Can it actually achieve insight?

Right, this is the question I am interested in. Is there a world model? (To what degree?)