Formal Inner Alignment, Prospectus

by Abram Demski24 min read12th May 202147 comments

47

Inner AlignmentAI
Curated

Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document. (Feel free to do that last one without reading further, if you are time-constrained!)


The State of the Subfield

Risks from Learned Optimization (henceforth, RLO) offered semi-formal definitions of important terms, and provided an excellent introduction to the area for a lot of people (and clarified my own thoughts and the thoughts of others who I know, even though we had already been thinking about these things).

However, RLO spent a lot of time on highly informal arguments (analogies to evolution, developmental stories about deception) which help establish the plausibility of the problem. While I feel these were important motivation, in hindsight I think they've caused some misunderstandings. My interactions with some other researchers has caused me to worry that some people confuse the positive arguments for plausibility with the core problem, and in some cases have exactly the wrong impression about the core problem. This results in mistakenly trying to block the plausibility arguments, which I see as merely illustrative, rather than attacking the core problem.

By no means do I intend to malign experimental or informal/semiformal work. Rather, by focusing on formal theoretical work, I aim to fill a hole I perceive in the field. I am very appreciative of much of the informal/semiformal work that has been done so far, and continue to think that kind of work is necessary for the crystallization of good concepts.

Focusing on the Core Problem

In order to establish safety properties, we would like robust safety arguments ("X will not happen" / "X has an extremely low probability of happening"). For example, arguments that probability of catastrophe will be very low, or arguments that probability of intentional catastrophe will be very low (ie, intent-alignment), or something along those lines.

For me, the core inner alignment problem is the absence of such an argument in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers.

Instead, I see many people focusing on blocking the plausibility arguments in RLO. This strikes me as the wrong direction. To me, these arguments are merely illustrative.

It seems like some people have gotten the impression that when the assumptions of the plausibility arguments in RLO aren't met, we should not expect an inner alignment problem to arise. Not only does this attitude misunderstand what we want (ie, a strong argument that we won't encounter a problem) -- I further think it's actually wrong (because when we look at almost any case, we see cause for concern).

Examples:

The Developmental Story

One recent conversation involved a line of research based on the developmental story, where a mesa-optimizer develops a pseudo-aligned objective early in training (an objective with a strong statistical correlation to the true objective in the training data), but as it learns more about the world, it improves its training score by becoming deceptive rather than by fixing the pseudo-aligned objective. The research proposal being presented to me involved shaping the early pseudo-aligned objective in very coarse-grained ways, which might ensure (for example) a high preference for cooperative behavior, or a low tolerance for risk (catastrophic actions might be expected to be particularly risky), etc. 

This line of research seemed promising to the person I was talking to, because they supposed that while it might be very difficult to precisely control the objectives of a mesa-optimizer or rule out mesa-optimizers entirely, it might be easy to coarsely shape the mesa-objectives.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these coarse-grained adjustments would fail.

(Possibly it was a mistake to offer a plausibility argument, because the rest of the discussion focused on this plausibility argument, again distracting from the core problem!)

The Evolutionary Story

Another recent conversation involved an over-emphasis on the evolutionary analogy. This person believed the inner optimizer problem would apply when systems were incentivised to be goal-oriented, as with animals selected for reproductive fitness, or policy networks trained to pursue reward. However, they did not believe it would apply to networks which are simply trained to predict, such as GPT.

Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.

Bounding the Problem

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I want to return to that idea. But first, we have to clarify some definitions.

The Formal Problem

I currently see three areas of concern:

  1. Behavioral changes at deployment: The mesa-optimizer appears to perform well during training, but after deployment, competently pursues goals which are unintended and unwanted. For my purposes I will call any such occurrences "treacherous turns" -- although this term is often defined in a way that requires intentional deception, for the moment, I'm not requiring that.
  2. Manipulation of imperfect search: see here and here.
  3. Dangerous hardware manipulation: EG, models which develop row-hammer attacks during training to increase their (supposed) score, exploit security holes to communicate with or manipulate the external environment, etc.

Vanessa terms #1 Cartesian daemons because they obey the intended input/output protocol of the whole system, and #3 non-Cartesian daemons because they violate the protocol. I'm not sure whether/where #2 falls on Vanessa's classification.

For this short write-up I'll focus on #1, although clearly #2 and #3 are also important areas of study.

My semi-formal description of the problem is going to be very close to the explain-like-i'm-12 version:

  1. A rich hypothesis space does contain (malign) mesa-optimizers, even if they occupy a small fraction of the space.
  2. A powerful search over that space effectively restricts outputs to a very small fraction, namely, the points in the space which score very well on the objective function.
  3. Some of the (malign) mesa-optimizers will score very well, so, there will be an overlap.
  4. Therefore, we expect there to be a chance of getting (malign) mesa-optimizers.

Essentially what we want to do is block or weaken this conclusion (concluding that there is no, or only a very small, chance).

(The "(malign)" in parentheses should be removed in approaches which are trying to avoid mesa-optimization entirely, and included in approaches which are only trying to block bad actors. I'll sometimes use the term "malign hypothesis" to indicate hypotheses which produce catastrophic output, regardless of whether they contain a mesa-optimizer in any formal sense.)

Now, to make this into a formal problem, we have to define "mesa-optimizer". 

Defining Mesa-Optimization

RLO offered a semi-formal definition of mesa-optimizers, as objects in a search space which themselves conduct an internal search. Whether or not you agree with this definition, it's too vague for formal purposes: how would we define "internal search"?

So, for the purpose of the formal problem, it's necessary to pick an operational definition.

Fortunately, it's possible to select operational definitions which have strong connections to safety concerns. In particular, I will call an operationalization tight if it includes more than necessary, so that we can be sure we're ruling out the undesired behavior when we rule out its operationalization. ("Tight" is supposed to sound like the opposite of "leaky" as in "leaky abstraction".)

As with any research, the choice of operational definitions is going to depend on the specific approach of the researcher, norms of the field, etc. Here are some options that occur to me:

  • Generalization accuracy.
    • This throws away all details of the inner alignment problem, and treats treacherous turns as simple generalization error. This has the advantage of engaging existing machine learning theory. Sufficiently strong generalization guarantees would indeed rule out treacherous turns, so, the operationalization is tight.
  • Importance-weighted generalization accuracy.
    • The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time. A formal model could attempt to incorporate this.
  • Significant inference during deployment.
    • The intuition here is that a mesa-optimizer has to do some kind of significant planning, so we can rule them out by eliminating any computations which are "too sophisticated".
    • Possible versions of this include limiting neural network layers, or limiting run-time.
    • An operationalization of this form may or may not be tight; this would require further research/argument.
  • Explicit representation of goals/values.
    • I argue here that this may be relevant. The basic intuition is that in order to competently pursue unintended goals at deployment, there must be a compressed representation of those goals (rather than just compressed policies learned for specific situations).
    • Again, this may not be tight; further research required.

Obviously, this is not exhaustive by any means. Coming up with productive operational definitions is an important part of attacking the formal inner alignment problem.

The goal is then to provide some kind of strong argument which avoids the operationalized problem.

Why is this a problem?

Why can't we address the inner alignment problem in the same way that we usually address problems of induction?

The most obvious way to avoid mesa-optimizers is to rule them out of the search space or the results of search. Earlier, I said:

[The hope was that] we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis [...].

The problem is precisely that we know of no way of doing that! 

I was referring to the idea that the first thing one tries when hearing about the inner alignment problem is probably this kind of complexity-bound approach, which leads me to think of "the whole problem" as the fact that these approaches don't seem to work.

So, let's look at these approaches in more detail.

The hope is that we can be confident that mesa-optimizers will be more complex in comparison with benign outputs, so that we can avert mesa-optimization by using machine learning techniques which heavily bias toward less complex models. (This is "the first thing one tries" because it employs standard ideas from machine learning.)

The main theoretical choice one has is, what notion of "complexity"? I will look at three approaches:

  • Description complexity. That is, how compressed is the model? How short is its description?
  • Computational complexity. That is, how fast is the model?
  • Combining the two. If neither work on their own, perhaps some combination of the two would work?

Description Complexity

The idea here would have to be that mesa-optimizers have to have a larger description length than benign hypotheses.

Why you might think this idea would work:

  • You trust Occam's razor. The whole point of Solomonoff induction is to give an objective theory of normatively correct inductive inference. So, almost by definition, the hypotheses favored by short description length should be "good explanations" of the universe, rather than unindended & undesirable hypotheses like mesa-optimizers.
  • A benign hypothesis just has to predict reality. A mesa-optimizer has to predict reality and also represent goal content and planning algorithms. So the mesa-optimizer has to be more complex, right?
  • A benign hypothesis just has to accurately perform the task. A deceptive mesa-optimizer has to also reason about the training process and other details of the world it's in, in order to choose the right moment to execute a treacherous turn. Surely this would be more complex.

Why this idea doesn't seem to work:

  • Even in the best case (where you not only have realizability, but furthermore, know your prior is a pretty good one), Occam's razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily the shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.
    • In particular, I am invoking the idea that belief that a prior is near-optimal for predictive purposes does not imply trust in a prior for instrumental purposes, because rare predictive errors can be catastrophic for planning and decision making if they're selected adversarially rather than distributed randomly.
    • This suggests a line of research examining conditions under which this does/doesn't hold, and perhaps investigating planning algorithms which don't entirely trust the prior.
  • Mutual information between predicting reality and agency may mean mesa-optimizers don't have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn't have to spend extra bits on these things, because it already needs to describe them in order to predict well.
    • This suggests a line of research into tasks which do not have mutual information with agency. Are there conditions under which we can be confident that mesa-optimizers would have high descriptive complexity?
    • However, it's possible that any hypothesis can be transformed into a treacherous mesa-optimizing variant with a relatively short string of injected code. If this is the case, it guarantees that malign and benign hypotheses are not too different in description-length. This would suggest that it'll be difficult to achieve high confidence that malign hypotheses always have larger description length.
  • Similarly, if a task is rich enough to involve the richness of the world (eg text prediction), then the description of the mesa-optimizer doesn't need to spend extra bits on describing the training process and so on, because it's already modeling complexities of the world such as that. In such a case, there's no apparent simplicity-based argument against deception.
    • Are there tasks which we can confidently assert are simpler than the whole world, so that we can confidently rule out deception in simple models?
  • Also, see the reasoning in The Solomonoff Prior is Malign.

Pure Computational Complexity

Alright, if description complexity can't rule out mesa-optimizers, maybe computational complexity can? I'm going to formalize "fast programs" as minimal circuits, where "circuit size" is understood as the number of logic gates.

Why you might think this would work:

  • Maybe you think the inner alignment problem for Solomonoff induction arises due to "too much compute". After all, Paul's original simulation warfare argument relied on simulating whole alien civilizations. More to the point, if mesa-optimization relies on internal search, well, that takes time, right? If we can find fast programs, maybe we force any "intelligence" out of the programs.
  • Specifically, you might think that the fastest way to do just about anything is essentially a lookup table.

Why it doesn't seem to work:

  • A lookup table is exponential in size (since you'd be checking for each combination of inputs and making an entry for each). So yes, minimal circuits rule out any algorithm with more than exponentially many steps. However, there are lots of concerting, sophisticated algorithms with sub-exponential numbers of steps. The minimal-circuit formalism will prefer these to lookup tables. (Note that even exponential-time calculations might involve circuits smaller than lookup tables.)
    • In particular, textbook AI algorithms are mostly about taking things which naively require exponential computations, and producing useful results in sub-exponential time.
  • Based on this, you should actively expect that minimal circuits implement sophisticated algorithms to solve problems, if such algorithms exist (and are not super-exponential).
  • Separately, we now think minimal circuits can be problematic because both Evan and Vanessa have proposed counterexamples. (Neither are necessarily the final word on the subject, though, since one may choose to work with different assumptions/definitions.)

It might seem like we're just not cutting off enough processing power -- the minimal-circuits formalism cuts off mesa-optimization which requires superexponential thinking time, but not sub-exponential. Let's call this the "computational ceiling" of the formalism. Shouldn't we just pick a formalism with a lower computational ceiling?

Maybe! Let's look at a formalism which gives us a knob to adjust, so that we can choose to penalize time more/less harshly.

Mixing Time and Description Complexity

The speed prior is like Solomonoff induction, except that it penalizes computation time. Specifically, taking twice as long is like 1 more bit of description length. Combining time and description complexity in this way is called Levin complexity.

For Levin complexity, time is penalized logarithmically (in comparison to description length). This suggests that we will get an exponential computational ceiling again: if computing  bits costs  time, we're around the break-even point (it might be worth computing the bits, or it might be better to memorize them). So super-exponential calculations won't be worth it, and sub-exponential calculations will, just like we saw with minimal circuits.

However, it's easy to tweak the time/description trade-off. Let's define Levin complexity as , where  is description length and  is execution time (the choice of  gives us regular Levin complexity). (I am not sure if Levin complexity has a more standard name in the literature.) This gives us a computational ceiling of .

Why this might work:

  • If the computation of a bit (via search or other algorithmic means) takes too long, you're better off just memorizing it. We can set this ceiling wherever we like. If we think "dangerously intelligent" computations are cubic and above, we can set , for example. This gives us a quadratic computational ceiling.

Why this might not work:

  • Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levin-complexity can't rule out humanlike intelligence.
  • Furthermore, forcing hypotheses to be "slow" like this may only delay the inevitable. Let's say for the sake of argument that a treacherous turn can be executed at any time, but requires a specific computation C to be executed (say, reversing the cryptographic hash of a critical password). This computation C can simply be slowed down to get past any computational ceiling. Thus, if our notion of (malign) inner optimizer resembles "eventually executes a treacherous turn", then lowering the computational ceiling doesn't actually help.

Elaborating on that last point: I'm assuming that you are asking for a computation that runs "sufficiently long" anyway. The speed prior does have a heavy bias for computations which cut off quickly. However, if we're (say) predicting the weather, we probably want to require that the calculation keeps going on for a while into the future, rather than cutting off. If we're training an AI traffic control system, we'd certainly require that it doesn't just cut off after a few days. The idea, then, is that the treacherous computation C can be spaced out through the time the AI traffic control system is required to remain on-line.

Taking everything together, I don't think we have a strong basis (yet!) for thinking penalizing runtime makes the situation any better. Some have even argued that it makes the situation worse, because time constraints provide an extra incentive for complex value-driven computations (in order to manage time effectively) as opposed to simple causal hypotheses (such as raw physics). I don't think we have enough information to judge one way or the other.

A Note on the Consensus Algorithm

The eventual write-up, which this is a mere prospectus for, would aim to cover as broad a variety of approaches as possible, and give our current best assessment of the viability of each (ie, report why it might work and why it might not work).

I can't cover everything I'd want to cover in this prospectus, but one more approach sticks out as deserving of treatment. In February of this year, Michael Cohen posted Formal Solution to the Inner Alignment Problem. This used an algorithm which I had heard many others talk about in the context of inner alignment, but usually, with considerably less hope and optimism than Michael Cohen expresses. I'll call the algorithm the consensus algorithm. (I don't know of a better name for it; I've heard it called KWIK learning (Knows-What-It-Knows), but unfortunately KWIK is actually used for a learning setting, not the algorithm that's typically applied in that setting.)

The consensus algorithm works like this:

  • You have a (finite) ensemble of hypotheses which you get from somewhere. For example, a Bayesian could take their top 100 hypotheses, or select a probability threshold of .01% and select all hypotheses above this threshold. A deep-learning person might instead re-run their training algorithm 100 times, different initializations and hyperparameters.
  • You require agreement between these hypotheses in order to output any prediction. In the absence of such agreement, you instead say "I need further feedback".
  • When you get further feedback, you get to edit your list, EG throwing out the hypotheses which were wrong (but also, perhaps, bringing in new replacement hypotheses which now exceed the .01% threshold thanks to a Bayesian update).

Note that I'm not nailing down all details, EG, how closely do we require models to agree? (Do we just require that they agree about which option is most probable? Do we require their probabilities to fall within the same small range?)

The consensus algorithm allows us to make some nice formal arguments. For example:

  1. Suppose that we think at least one of our top 100 hypotheses is (a) always correct (or at least as accurate as the rest of the top 100) and (b) non-malign.
  2. Then no malign hypothesis will be able to control the output. Malign hypotheses are limited to either agreeing with the non-malign hypothesis, or, blocking any output by spoiling the consensus.

However, other drawbacks seem to spoil the appeal of this result.

  • To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of. We're amplifying safety at the expense of speed; if we think the base algorithm has a 10% frequency of yielding safe AGI, then we only need around a 10x slowdown to get a high chance of safety; if it's a .01% chance, we would need something more like a 10,000x slowdown (since we'd want to require consensus of around 10,000 independent hypotheses).
  • Second, to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI. This is probably far from true. Our uncertainty about whether a particular training method produces safe AGI is probably almost entirely correlated. This would further blow up the size of the ensemble required to achieve any confidence.
  • Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it's in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there's a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.

Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic:

There are none of the usual embedding problems here—the imitator can just be bigger than the demonstrator that it's modeling.

I disagree. It still doesn't seem realistic to suppose that competitors to the best hypothesis are never lucky. Even if we suppose that my behavior is entirely deterministic, the number of hypotheses you'd have to consider in order for one of them to be an exact copy of my brain would no doubt be huge! Let's conservatively call this  (suggesting that I have at least 100 neurons, each of which has at least 10 meaningfully different configurations). This would suggest that the consensus algorithm needs googol times the processing power of the human brain to work.

Michael Cohen also says some other stuff about why the proposal doesn't have embedding problems, which I also disagree with. But let's set this aside and get back to listing other problems with the consensus algorithm.

  • Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash. Obviously we can't put a system like this in charge of extremely time-sensitive decisions like this. But whatever the case, asking for feedback will influence the external world somehow. Malign hypotheses just need to find the right moment, where asking for feedback butterfly-effects its way to a catastrophe.
  • Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly. Specifically, if we have an ensemble of programs of description-length at most , one of which we believe to be correct and benign, the best convergence bound we can give is busy beaver of l!! (For those who don't know about the busy beaver function, this roughly means "uncomputably slow". It's way worse than needing exponentially much data. I discuss why this happens in Subsystem Alignment.) This means we can't solve the previous problem by keeping the system in a safe training environment (eg, keeping it away from time-sensitive tasks, and making sure that it always has a human on-duty to provide feedback when prompted) until it converges. We have to face the fact that it might require human feedback at any point in the future.

The fourth point really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.

Conclusion

Has this been useful? Would an expanded and improved version of this be useful?

This is something where I could really use detailed peer-review-like feedback, since the final version of this thing would hopefully be a pretty canonical resource, with standardized terminology and so on.

A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions. I think this is somewhat inevitable, but nonetheless, could probably be improved. What I'd like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.

I'd be glad to get any of the following types of feedback:

  1. Possible definitions/operationalizations of significant concepts.
  2. Ideas about which definitions and assumptions to focus on.
  3. Approaches that I'm missing. I'd love to have a basically exhaustive list of approaches to the problem discussed so far, even though I have not made a serious attempt at that in this document.
  4. Any brainstorming you want to do based on what I've said -- variants of approaches I listed, new arguments, etc.
  5. Suggested background reading.
  6. Nitpicking little choices I made here.
  7. Any other type of feedback which might be relevant to putting together a better version of this.

If you take nothing else away from this, I'm hoping you take away this one idea: the main point of the inner alignment problem (at least to me) is that we know hardly anything about the relationship between the outer optimizer and any mesa-optimizers. There are hardly any settings where we can rule mesa-optimizers out. And we can't strongly argue for any particular connection (good or bad) between outer objectives and inner.

47

47 comments, sorted by Highlighting new comments since Today at 6:35 AM
New Comment

I'll make a case here that manipulation of imperfect internal search should be considered the inner alignment problem, and all the other things which look like inner alignment failures actually stem from outer alignment failure or non-mesa-optimizer-specific generalization failure.

Example: Dr Nefarious

Suppose Dr Nefarious is an agent in the environment who wants to acausally manipulate other agents' models. We have a model which knows of Dr Nefarious' existence, and we ask the model to predict what Dr Nefarious will do. At this point, we have already failed: either the model returns a correct answer, in which case Dr Nefarious has acausal control over the answer and can manipulate us through it, or it returns an incorrect answer, in which case the prediction is wrong. (More precisely, the distinction is between informative/independent answers, not correct/incorrect.) The only way to avoid this would be to not ask the question in the first place - but if we need to know what Dr Nefarious will do in order to make good decisions ourselves, then we need to run that query.

On the surface, this looks like an inner alignment failure: there's a malign subagent in the model. But notice that it's not even clear what we want in this situation - we don't know how to write down a goal-specification which avoids the problem while also being useful. The question of "what do we even want to do in this sort of situation?" is unambiguously an outer alignment question. It's not a situation where we know what we want but we're not sure how to make a system actually do it; it's a situation where it's not even clear what we want. 

Conversely, if we did have a good specification of what we want in this situation, then we could just specify that in the outer objective. Once that's done, we would still potentially need to solve inner alignment problems in practice, but we'd know how to solve them in principle: do the thing which is globally optimal for our outer objective. The whole point of "having a good specification of what we want" is that the globally-optimal thing should be good.

Point of all this: this supposed "inner alignment failure" can be broken into two parts. One of those parts is a "what do we even want?" question, i.e. an outer alignment problem. The other part is a problem of actually achieving the optimal thing, which is where manipulation of imperfect internal search is relevant. If both of those parts are solved, then the system is aligned.

Generalizing The Example

Another example, this time with explicit acausal trade: our AI uses a Solomonoff-like world model, and a subagent in that model is trying to gain influence. Meanwhile, an (unrelated) nefarious agent in the environment wants to manipulate the AI. So, the subagent and the nefarious agent simulate each other and make an acausal deal: the nefarious agent produces a very specific string of bits in the real world, and the subagent gains weight by perfectly predicting that string. In exchange, the subagent manipulates the AI to help the nefarious agent in some way.

Self-fulfilling prophecies provide a similar, but simpler, class of examples.

In each of these, there is a malign inner agent, but that malign inner agent is only able to manipulate the AI successfully because of some structure in the environment. Or, another way to state it: the malign agent is successful only because the combination of (outer) prior + objective does not handle self-fulfilling prophecies or acausal trade the way we (humans) want them to. These are, in an important sense, outer alignment problems: we have not correctly specified what-we-want; even the global optimum of the outer process suffers from the problem.

Objective Is Only Defined With Prior + Data

One possible objection to this is that "outer alignment" - i.e. specifying what-humans-want - should be more narrowly interpreted. In particular, Evan has argued before that generalization errors resulting from e.g. distribution shift between training data and deployment environment should be considered a separate problem.

I disagree with this. I claim that an objective isn't even well-defined without a distribution; that's part of the type-signature of an objective.

This is easy to see in the case of an expected utility maximizer. When we write "max E[u(X)]", X is a variable in the probabilistic model. It is a thing-in-the-model, not a thing-in-the-world; the world does not necessarily share our ontology.

We could say something similar for any setup which maximizes some expected value on an empirical distribution, i.e. an average over the training data. For instance, maybe we have some labeled images, and we're training a classifier. We may have an objective for which the system does-what-we-want for the original labels, but does not do what we want if we change the objective function to permute the labels before calculating error (i.e. it switches "true" with "false"). Permuting the labels in the objective function is obviously an outer alignment problem - yet we can achieve exactly the same effect by permuting the labels in the dataset instead.

Another angle: plenty of ML work uses the exact same objective on different data sets, and obviously they do completely different things. There is no useful sense in which a training objective can be aligned or misaligned, separate from the context of data/prior.

My point is: there is no line between bad training data and bad objective. These problems only make sense when considered together. So, if "bad training objective" is an outer alignment problem, then we also need to consider "bad training data" to be an outer alignment problem in order for our factorization of the problem to work well. (For Bayesian agents, this also extends to the prior.)

The General Argument

Outer objective, training data, and prior (if any) all have to be treated as a unit: changes in one are equivalent to changes in another, and the objective isn't even well-defined outside the ontology of the data/prior. The central outer alignment question of "what do we even want?" has to be answered with both an objective and data/prior, in order for the answer to be well-defined.

If we buy that, then outer alignment (i.e. fully answering the question "what do we want?") implies that the true global optimum of an outer optimizer's search is aligned. So, there's only one type of inner alignment problem which would not be solved by solving outer alignment: manipulation of imperfect search. We can have a good objective+prior+data, but the search may still be imperfect, and malign subagents may arise which manipulate that search.

All that said... there's still an interesting alignment problem which examples like Dr Nefarious or self-fulfilling prophecies or maligness of Solomonoff are pointing to. I claim that inner alignment is not the right way to think about these - it's not the malign inner agents themselves which are the problem. They're just an indicator that we have not correctly specified what-we-want.

This is a great comment. I will have to think more about your overall point, but aside from that, you've made some really useful distinctions. I've been wondering if inner alignment should be defined separately from mesa-optimizer problems, and this seems like more evidence in that direction (ie, the dr nefarious example is a mesa-optimization problem, but it's about outer alignment). Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment! Objective quality vs search quality is a nice dividing line, but, doesn't cluster together the problems people have been trying to cluster together.

Haven't read the full comment thread, but on this sentence

Or maybe inner alignment just shouldn't be seen as the compliment of outer alignment!

Evan actually wrote a post to explain that it isn't the complement for him (and not the compliment either :p) 

Right, but John is disagreeing with Evan's frame, and John's argument that such-and-such problems aren't inner alignment problems is that they are outer alignment problems.

 So, I think I could write a much longer response to this (perhaps another post), but I'm more or less not persuaded that problems should be cut up the way you say.

As I mentioned in my other reply, your argument that Dr. Nefarious problems shouldn't be classified as inner alignment is that they are apparently outer alignment. If inner alignment problems are roughly "the internal objective doesn't match the external objective" and outer alignment problems are roughly "the outer objective doesn't meet our needs/goals", then there's no reason why these have to be mutually exclusive categories.

In particular, Dr. Nefarious problems can be both.

But more importantly, I don't entirely buy your notion of "optimization". This is the part that would require a longer explanation to be a proper reply. But basically, I want to distinguish between "optimization" and "optimization under uncertainty". Optimization under uncertainty is not optimization -- that is, it is not optimization of the type you're describing, where you have a well-defined objective which you're simply feeding to a search. Given a prior, you can reduce optimization-under-uncertainty to plain optimization (if you can afford the probabilistic inference necessary to take the expectations, which often isn't the case). But that doesn't mean that you do, and anyway, I want to keep them as separate concepts even if one is often implemented by the other.

Your notion of the inner alignment problem applies only to optimization.

Evan's notion of inner alignment applies (only!) to optimization under uncertainty.

I buy the "problems can be both" argument in principle. However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all. I also think a version of this argument probably carries over even if we're thinking about optimization-under-uncertainty, although I'm still not sure exactly what that would mean.

In other words: if a problem is both, then it is useful to think of it as an outer alignment problem (because that part has to be solved regardless), and not also inner alignment (because only a narrower version of that part necessarily has to be solved). In the Dr Nefarious example, the outer misalignment causes the inner misalignment in some important sense - correcting the outer problem fixes the inner problem , but patching the inner problem would leave an outer objective which still isn't what we want.

I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.

I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself.

But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today's ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I'm not clear on exactly what it would mean.

Trying to lay this disagreement out plainly:

According to you, the inner alignment problem should apply to well-defined optimization problems, meaning optimization problems which have been given all the pieces needed to score domain items. Within this frame, the only reasonable definition is "inner" = issues of imperfect search, "outer" = issues of objective (which can include the prior, the utility function, etc).

According to me/Evan, the inner alignment problem should apply to optimization under uncertainty, which is a notion of optimization where you don't have enough information to really score domain items. In this frame, it seems reasonable to point to the way the algorithm tries to fill in the missing information as the location of "inner optimizers". This "way the algorithm tries to fill in missing info" has to include properties of the search, so we roll search+prior together into "inductive bias".

I take your argument to have been:

  1. The strength of well-defined optimization as a natural concept;
  2. The weakness of any factorization which separates elements like prior, data, and loss function, because we really need to consider these together in order to see what task is being set for an ML system (Dr Nefarious demonstrates that the task "prediction" becomes the task "create a catastrophe" if prediction is pointed at the wrong data);
  3. The idea that the my/Evan/Paul's concern about priors will necessarily be addressed by outer alignment, so does not need to be solved separately.

Your crux is, can we factor 'uncertainty' from 'value pointer' such that the notion of 'value pointer' contains all (and only) the outer alignment issues? In that case, you could come around to optimization-under-uncertainty as a frame.

I take my argument to have been: 

  1. The strength of optimization-under-uncertainty as a natural concept (I argue it is more often applicable than well-defined optimization);
  2. The naturalness of referring to problems involving inner optimizers under one umbrella "inner alignment problem", whether or not Dr Nefarious is involved;
  3. The idea that the malign-prior problem has to be solved in itself whether we group it as an "inner issue" or an "outer issue";
  4. For myself in particular, I'm ok with some issues-of-prior, such as Dr Nefarious, ending up as both inner alignment and outer alignment in a classification scheme (not overjoyed, but ok with it).

My crux would be, does a solution to outer alignment (in the intuitive sense) really imply a solution to exorcising mesa-optimizers from a prior (in the sense relevant to eliminating them from perfect search)?

It might also help if I point out that well-defined-optimization vs optimization-under-uncertainty is my current version of the selection/control distinction.

In any case, I'm pretty won over by the uncertainty/pointer distinction. I think it's similar to the capabilities/payload distinction Jessica has mentioned. This combines search and uncertainty (and any other generically useful optimization strategies) into the capabilities.

But I would clarify that, wrt the 'capabilities' element, there seem to be mundane capabilities questions and then inner optimizer questions. IE, we might broadly define "inner alignment" to include all questions about how to point 'capabilities' at 'payload', but if so, I currently think there's a special subset  of 'inner alignment' which is about mesa-optimizers. (Evan uses the term 'inner alignment' for mesa-optimizer problems, and 'objective-robustness' for broader issues of reliably pursuing goals, but he also uses the term 'capability robustness', suggesting he's not lumping all of the capabilities questions under 'objective robustness'.)

This is a good summary.

I'm still some combination of confused and unconvinced about optimization-under-uncertainty. Some points:

  • It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.
  • The examples of optimization-under-uncertainty from your other comment do not really seem to be about uncertainty per se, at least not in the usual sense, whereas the Dr Nefarious example and maligness of the universal prior do.
  • Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).
  • It does seem like there's in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)
  • The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.

... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.)

On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can't be separated from goals, and me to say the opposite.

  • Your examples in the other comment do feel closely related to your ideas on learning normativity, whereas inner agency problems do not feel particularly related to that (or at least not any more so than anything else is related to normativity).

Could you elaborate on that? I do think that learning-normativity is more about outer alignment. However, some ideas might cross-apply.

  • It feels like "optimization under uncertainty" is not quite the right name for the thing you're trying to point to with that phrase, and I think your explanations would make more sense if we had a better name for it.

Well, it still seems like a good name to me, so I'm curious what you are thinking here. What name would communicate better?

  • It does seem like there's in important sense in which inner agency problems are about uncertainty, in a way which could potentially be factored out, but that seems less true of the examples in your other comment. (Or to the extent that it is true of those examples, it seems true in a different way than the inner agency examples.)

Again, I need more unpacking to be able to say much (or update much).

  • The pointers problem feels more tightly entangled with your optimization-under-uncertainty examples than with inner agency examples.

Well, the optimization-under-uncertainty is an attempt to make a frame which can contain both, so this isn't necessarily a problem... but I am curious what feels non-tight about inner agency.

... so I guess my main gut-feel at this point is that it does seem very plausible that uncertainty-handling (and inner agency with it) could be factored out of goal-specification (including pointers), but this particular idea of optimization-under-uncertainty seems like it's capturing something different. (Though that's based on just a handful of examples, so the idea in your head is probably quite different from what I've interpolated from those examples.)

On a side note, it feels weird to be the one saying "we can't separate uncertainty-handling from goals" and you saying "ok but it seems like goals and uncertainty could somehow be factored". Usually I expect you to be the one saying uncertainty can't be separated from goals, and me to say the opposite.

I still agree with the hypothetical me making the opposite point ;p The problem is that certain things are being conflated, so both "uncertainty can't be separated from goals" and "uncertainty can be separated from goals" have true interpretations. (I have those interpretations clear in my head, but communication is hard.)

OK, so.

My sense of our remaining disagreement... 

We agree that the pointers/uncertainty could be factored (at least informally -- currently waiting on any formalism).

You think "optimization under uncertainty" is doing something different, and I think it's doing something close.

Specifically, I think "optimization under uncertainty" importantly is not necessarily best understood as the standard Bayesian thing where we (1) start with a utility function, (2) provide a prior, so that we can evaluate expected value (and 2.5, update on any evidence), (3) provide a search method, so that we solve the whole thing by searching for the highest-expectation element. Many examples of optimization-under-uncertainty strain this model. Probably the pointer/uncertainty model would do a better job in these cases. But, the Bayesian model is kind of the only one we have, so we can use it provisionally. And when we do so, the approximation of pointer-vs-uncertainty that comes out is:

Pointer: The utility function.

Uncertainty: The search plus the prior, which in practice can blend together into "inductive bias".

This isn't perfect, by any means, but, I'm like, "this isn't so bad, right?"

I mean, I think this approximation is very not-good for talking about the pointers problem. But I think it's not so bad for talking about inner alignment.

I almost want to suggest that we hold off on trying to resolve this, and first, I write a whole post about "optimization under uncertainty" which clarifies the whole idea and argues for its centrality. However, I kind of don't have time for that atm.

However, when a problem involves both, it seems like we have to solve the outer part of the problem (i.e. figure out what-we-even-want), and once that's solved, all that's left for inner alignment is imperfect-optimizer-exploitation. The reverse does not apply: we do not necessarily have to solve the inner alignment issue (other than the imperfect-optimizer-exploiting part) at all.

The way I'm currently thinking of things, I would say the reverse also applies in this case.

We can turn optimization-under-uncertainty into well-defined optimization by assuming a prior. The outer alignment problem (in your sense) involves getting the prior right. Getting the prior right is part of "figuring out what we want". But this is precisely the source of the inner alignment problems in the paul/evan sense: Paul was pointing out a previously neglected issue about the Solomonoff prior, and Evan is talking about inductive biases of machine learning algorithms (which is sort of like the combination of a prior and imperfect search).

So both you and Evan and Paul are agreeing that there's this problem with the prior (/ inductive biases). It is distinct from other outer alignment problems (because we can, to a large extent, factor the problem of specifying an expected value calculation into the problem of specifying probabilities and the problem of specifying a value function / utility function / etc). Everyone would seem to agree that this part of the problem needs to be solved. The disagreement is just about whether to classify this part as "inner" and/or "outer".

What is this problem like? Well, it's broadly a quality-of-prior problem, but it has a different character from other quality-of-prior problems. For the most part, the quality of priors can be understood by thinking about average error being low, or mistakes becoming infrequent, etc. However, here, this kind of thinking isn't sufficient: we are concerned with rare but catastrophic errors. Thinking about these things, we find ourselves thinking in terms of "agents inside the prior" (or agents being favored by the inductive biases).

To what extent "agents in the prior" should be lumped together with "agents in imperfect search", I am not sure. But the term "inner optimizer" seems relevant.

I'd be interested in a more complete explanation of what optimization-under-uncertainty would mean, other than to take an expectation (or max/min, quantile, etc) to convert it into a deterministic optimization problem.

A good example of optimization-under-uncertainty that doesn't look like that (at least, not overtly) is most applications of gradient descent.

  1. The true objective is not well-defined. IE, machine learning people generally can't write down an objective function which (a) spells out what they want, and (b) can be evaluated. (What you want is generalization accuracy for the presently-unknown deployment data.)
  2. So, machine learning people create proxies to optimize. Training data is the start, but then you add regularizing terms to penalize complex theories.
  3. But none of these proxies is the full expected value (ie, expected generalization accuracy). If we could compute the full expected value, we probably wouldn't be searching for a model at all! We would just use the EV calculations to make the best decision for each individual case.

So you can see, we can always technically turn optimization-under-uncertainty into a well-defined optimization by providing a prior, but, this is usually so impractical that ML people often don't even consider what their prior might be. Even if you did write down a prior, you'd probably have to do ordinary ML search to approximate that. Which goes to show that it's pretty hard to eliminate the non-EV versions of optimization-under-uncertainty; if you try to do real EV, you end up using non-EV methods anyway, to approximate EV.

The fact that we're not really optimizing EV, in typical applications of gradient descent, explains why methods like early stopping or dropout (or anything else that messes with the ability of gradient descent to optimize the given objective) might be useful. Otherwise, you would only expect to use modifications if they helped the search find higher-value items. But in real cases, we sometimes prefer items that have a lower score on our proxy, when the-way-we-got-that-item gives us other reason to expect it to be good (early stopping being the clearest example of this).

This in turn means we don't even necessarily convert our problem to a real, solidly defined optimization problem, ever. We can use algorithms like gradient-descent-with-early-stopping just "because they work well" rather than because they optimize some specific quantity we can already compute.

Which also complicates your argument, since if we're never converting things to well-defined optimization problems, we can't factor things into "imperfect search problems" vs "alignment given perfect search" -- because we're not really using search algorithms (in the sense of algorithms designed to get the maximum value), we're using algorithms with a strong family resemblance to search, but which may have a few overtly-suboptimal kinks thrown in because those kinks tend to reduce Goodharting.

In principle, a solution to an optimization-under-uncertainty problem needn't look like search at all.

Ah, here's an example: online convex optimization. It's a solid example of optimization-under-uncertainty, but, not necessarily thought of in terms of a prior and an expectation.

So optimization-under-uncertainty doesn't necessarily reduce to optimization.

I claim it's usually better to think about optimization-under-uncertainty in terms of regret bounds, rather than reduce it to maximization. (EG this is why Vanessa's approach to decision theory is superior.)

I'm not sure the optimization vs optimization-under-uncertainty distinction is actually all that central, though. Intuitively, the reason an objective isn't well-defined without the data/prior is that the data/prior defines the ontology, or defines what the things-in-the-objective are pointing to (in the pointers-to-values sense) or something along those lines. If the objective function is f(X, Y), then the data/prior are what point "X" and "Y" at some things in the real world. That's why the objective function cannot be meaningfully separated from the data/prior: "f(X, Y)" doesn't mean anything, by itself.

But I could imagine the pointer-aspect of the data/prior could somehow be separated from the uncertainty-aspect. Obviously this would require a very different paradigm from either today's ML or Bayesianism, but if those pieces could be separated, then I could imagine a notion of inner alignment (and possibly also something like robust generalization) which talks about both optimization and uncertainty, plus a notion of outer alignment which just talks about the objective and what it points to. In some ways, I actually like that formulation better, although I'm not clear on exactly what it would mean.

These remarks generally make sense to me. Indeed, I think the 'uncertainty-aspect' and the 'search aspect' would be rolled up into one, since imperfect search falls under the uncertainty aspect (being logical uncertainty). We might not even be able to point to which parts are prior vs search... as with "inductive bias" in ML. So inner alignment problems would always be "the uncertainty is messed up" -- forcibly unifying your search-oriented view on daemons w/ Evan's prior-oriented view. More generally, we could describe the 'uncertainty' part as where 'capabilities' live.

Naturally, this strikes me as related to what I'm trying to get at with optimization-under-uncertainty. An optimization-under-uncertainty algorithm takes a pointer, and provides all the 'uncertainty'.

But I don't think it should quite be about separating the pointer-aspect and the uncertainty-aspect. The uncertainty aspect has what I'll call "mundane issues" (eg, does it converge well given evidence, does it keep uncertainty broad w/o evidence) and "extraordinary issues" (inner optimizers). Mundane issues can be investigated with existing statistical tools/concepts. But the extraordinary issues seem to require new concepts. The mundane issues have to do with things like averages and limit frequencies. The extraordinary issues have to do with one-time events.

The true heart of the problem is these "extraordinary issues".

While I agree that outer objective, training data and prior should be considered together, I disagree that it makes the inner alignment problem dissolve except for manipulation of the search. In principle, if you could indeed ensure through a smart choice of these three parameters that there is only one global optimum, only "bad" (meaning high loss) local minima, and that your search process will always reach the global optimum, then I would agree that the inner alignment problem disappears.

But answering "what do we even want?" at this level of precision seems basically impossible. I expect that it's pretty much equivalent to specifying exactly the result we want, which we are quite unable to do in general.

So my perspective is that the inner alignment problem appears because of inherent limits into our outer alignment capabilities. And that in realistic settings where we cannot rule out multiple very good local minima, the sort of reasoning underpinning the inner alignment discussion is the best approach we have to address such problems.

That being said, I'm not sure how this view interacts with yours or Evan's, or if this is a very standard use of the terms. But since that's part of the discussion Abram is pushing, here is how I use these terms.

Hm, I want to classify "defense against adversaries" as a separate category from both "inner alignment" and "outer alignment".

The obvious example is: if an adversarial AGI hacks into my AGI and changes its goals, that's not any kind of alignment problem, it's a defense-against-adversaries problem.

Then I would take that notion and extend it by saying "yes interacting with an adversary presents an attack surface, but also merely imagining an adversary presents an attack surface too". Well, at least in weird hypotheticals. I'm not convinced that this would really be a problem in practice, but I dunno, I haven't thought about it much.

Anyway, I would propose that the procedure for defense against adversaries in general is: (1) shelter an AGI from adversaries early in training, until it's reasonably intelligent and aligned, and then (2) trust the AGI to defend itself. I'm not sure we can do any better than that.

In particular, I imagine an intelligent and self-aware AGI that's aligned in trying to help me would deliberately avoid imagining an adversarial superintelligence that can acausally hijack its goals!

That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.

That still leaves the issue of early training, when the AGI is not yet motivated to not imagine adversaries, or not yet able. So I would say: if it does imagine the adversary, and then its goals do get hijacked, then at that point I would say "OK yes now it's misaligned". (Just like if a real adversary is exploiting a normal security hole—I would say the AGI is aligned before the adversary exploits that hole, and misaligned after.) Then what? Well, presumably, we will need to have procedure that verifies alignment before we release the AGI from its training box. And that procedure would presumably be indifferent to how the AGI came to be misaligned. So I don't think that's really a special problem we need to think about.

This part doesn't necessarily make sense, because prevention could be easier than after-the-fact measures. In particular,

  1. You might be unable to defend against arbitrarily adversarial cognition, so, you might want to prevent it early rather than try to detect it later, because you may be vulnerable in between.
  2. You might be able to detect some sorts of misalignment, but not others. In particular, it might be very difficult to detect purposeful deception, since it intelligently evades whatever measures are in place. So your misalignment-detection may be dependent on averting mesa-optimizers or specific sorts of mesa-optimizers.

That's fair. Other possible approaches are "try to ensure that imagining dangerous adversarial intelligences is aversive to the AGI-in-training ASAP, such that this motivation is installed before the AGI is able to do so", or "intepretability that looks for the AGI imagining dangerous adversarial intelligences".

I guess the fact that people don't tend to get hijacked by imagined adversaries gives me some hope that the first one is feasible - like, that maybe there's a big window where one is smart enough to understand that imagining adversarial intelligences can be bad, but not smart enough to do so with such fidelity that it actuality is dangerous.

But hard to say what's gonna work, if anything, at least at my current stage of general ignorance about the overall training process.

Since you're trying to compile a comprehensive overview of directions of research, I will try to summarize my own approach to this problem:

  • I want to have algorithms that admit thorough theoretical analysis. There's already plenty of bottom-up work on this (proving initially weak but increasingly stronger theoretical guarantees for deep learning). I want to complement it by top-down work (proving strong theoretical guarantees for algorithms that are initially infeasible but increasingly made more feasible). Hopefully eventually the two will meet in the middle.
  • Given feasible algorithmic building blocks with strong theoretical guarantees, some version of the consensus algorithm can tame Cartesian daemons (including manipulation of search) as long as the prior (inductive bias) of our algorithm is sufficiently good.
  • Coming up with a good prior is a problem in embedded agency. I believe I achieved significant progress on this using a certain infra-Bayesian approach, and hopefully will have a post soonish.
  • The consensus-like algorithm will involve a trade-off between safety and capability. We will have to manage this trade-off based on expectations regarding external dangers that we need to deal with (e.g. potential competing unaligned AIs). I believe this to be inevitable, although ofc I would be happy to be proven wrong.
  • The resulting AI is only a first stage that we will use to design the second stage AI, it's not something we will deploy in self-driving cars or such
  • Non-Cartesian daemons need to be addressed separately. Turing RL seems like a good way to study this if we assume the core is too weak to produce non-Cartesian daemons, so the latter can be modeled as potential catastrophic side effects of using the envelope. However, I don't have a satisfactory solution yet (aside perhaps homomorphic encryption, but the overhead might be prohibitive).

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).


Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign. But that does suggest that if you can construct a training process that defuses the arguments for why its prior/inductive biases will be malign, then I think that does make significant progress on defusing the inner alignment problem. Of course, I agree that we'd like to be as confident that there's as little malignancy/deception as possible such that just defusing the arguments that we can come up with might not be enough—but I still think that trying to figure out how plausible it is that the actual prior we use will be malign is in fact at least attempting to address the core problem.

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors).

If the universal prior were benign but NNs were still potentially malign, I think I would argue strongly against the use of NNs and in favor of more direct approximations of the universal prior. But, I agree this is not 100% obvious; giving up prosaic AI is giving up a lot.

Also, I think that this distinction leads me to view “the main point of the inner alignment problem” quite differently: I would say that the main point of the inner alignment problem is that whatever prior we use in practice will probably be malign.

Hopefully my final write-up won't contain so much polemicizing about what "the main point" is, like this write-up, and will instead just contain good descriptions of the various important problems.

I think there would still be an inner alignment problem even if deceptive models were in fact always more complicated than non-deceptive models—i.e. if the universal prior wasn't malign—which is just that the neural net prior (or whatever other ML prior we use) might be malign even if the universal prior isn't (and in fact I'm not sure that there's even that much of a connection between the malignity of those two priors)

Agree.

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

I agree with much of this. I over-sold the "absence of negative story" story; of course there has to be some positive story in order to be worried in the first place. I guess a more nuanced version would be that I am pretty concerned about the broadest positive story, "mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?" -- and think more specific positive stories are mostly of illustrative value, rather than really pointing to gears that I expect to be important. (With the exception of John's story, which did point to important gears.)

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable. However, that's not the sense the post conveyed overall, so I get it. I am concretely trying to convey pessimism about a specific sort of less-formal work: work which tries to block plausibility stories. Possibly you disagree about this kind of work.

WRT your argument for informal work, well, I agree in principle (trying to push toward more formal work myself has so far revealed challenges which I think more informal conceptual work could help with), but I'm nonetheless optimistic at the moment that we can define formal problems which won't be a waste of time to work on. And out of informal work, what seems most interesting is whatever pushes toward formality.

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

To me, the post as written seems like enough to spell out my optimism... there multiple directions for formal work which seem under-explored to me. Well, I suppose I didn't focus on explaining why things seem under-explored. Hopefully the writeup-to-come will make that clear.

Thanks for the post!

Here is my attempt at a detailed peer-review feedback. I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

One thing I really like is the multiple "failure" stories at the beginning. It's usually frustrating in posts like that to see people argue against position/arguments which are not written anywhere. Here we can actually see the problematic arguments.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection. I proceeded to offer a plausibility argument for a total disconnect between the two, such that even these course-grained adjustments would fail.

I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.

Again, this strikes me as ignoring the fundamental problem, that we have little to no idea when mesa-optimizers can arise, that we lack formal tools for the analysis of such questions, and that what formal tools we might have thought to apply, have failed to yield any such results.

Completely agreed. I always find such arguments unconvincing, not because I don't see where the people using them are coming from, but because such impossibility results require a way better understanding of what mesa-optimizers are and do that we have.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

Agreed too. I always find that weird when people use that argument, because it seems agreed upon in the field for a long time that there are probably simple goal-directed process in the search spaces. Like I can find a post from Paul's blog in 2012 where he writes:

This discussion has been brief and has necessarily glossed over several important difficulties. One difficulty is the danger of using computationally unbounded brute force search, given the possibility of short programs which exhibit goal-oriented behavior.

 

Defining Mesa-Optimization

There's one approach that you haven't described (although it's a bit close to your last one) and which I am particularly excited about: finding an operationalization of goal-directedness, and just define/redefine mesa-optimizers as learned goal-directed agents. My interpretation of RLO is that it's arguing that search for simple competent programs will probably find a goal-directed system AND that it might have a simple structure "parametrized with a goal" (so basically an inner optimizer). This last assumption was really relevant for making argument about the sort of architecture likely to evolved by gradient descent. But I don't think the arguments are tight enough to convince us that the learned goal-directed systems will necessarily have this kind of structure, and the sort of problems mentioned seems just as salient for other goal-directed systems.

I also believe that we're not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. Even if I'm wrong about being able to formalize goal-directedness, I'm pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.

The concept of generalization accuracy misses important issues. For example, a guaranteed very low frequency of errors might still allow an error to be strategically inserted at a very important time.

I really like this argument against using generalization. To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?

Occam's razor should only make you think one of the shortest hypotheses that fits your data is going to be correct, not necessarily the shortest one. So, this kind of thinking does not directly imply a lack of malign mesa-optimization in the shortest hypothesis.

A bit tangential, but this line of argument is exactly why I find the research of the loss landscape of a neural net frightening for inner alignment. What people try to prove for ML purpose is that there is no or few "bad" (high-loss) minima, where bad means high loss. But they're fine with many "good" (low-loss) local minima, and usually find many of them. Except that this is a terrible new for inner alignment, because the more "good" local minima, the more risk some of them are deceptive mesa-optimizers.

Mutual information between predicting reality and agency may mean mesa-optimizers don't have to spend extra bits on goal content and planning. In particular, if the reality being predicted contains goal-driven agents, then a mesa-optimizer doesn't have to spend extra bits on these things, because it already needs to describe them in order to predict well.

I hadn't thought of it that way, but that does capture nicely the intuition that any good enough agent for a sufficiently complex task will be able to model humans and deception, among other things. That being said, wouldn't the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?

Pure Computational Complexity

About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it's not redescribed in the intro or conclusion):

Furthermore, in the context of machine learning, this analysis suggests that a time complexity penalty (as opposed to a description length penalty) is a double-edged sword. In the second post, we suggested that penalizing time complexity might serve to reduce the likelihood of mesa-optimization. However, the above suggests that doing so would also promote pseudo-alignment in those cases where mesa-optimizers do arise. If the cost of fully modeling the base objective in the mesa-optimizer is large, then a pseudo-aligned mesa-optimizer might be preferred simply because it reduces time complexity, even if it would underperform a robustly aligned mesa-optimizer without such a penalty.

 

 

Humans are essentially linear-time algorithms, in the sense that we take the same maximum amount of processing power (ie, that of the human brain) to produce each next output. Anything which produces linearly much output has to do so in at least linear time. So, Levin-complexity can't rule out humanlike intelligence.

I don't understand what you're saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all. The issue here is that we don't have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can't discuss the asymptotic complexity of human-brain algorithms. But maybe you actually have an argument related to that which I missed?

 

One point about the time/description complexity penalty that I feel you don't point enough is that even if there was a threshold under which mesa-optimization doesn't appear, maybe it's just too low to be competitive. That's my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.

A Note on the Consensus Algorithm

As someone who has been unconvinced with this proposal as a solution for inner alignment, but didn't take the time to express exactly why, I feel like you did a pretty nice work, and probably what I will point people to when they ask about this post.

I also believe that we're not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree. 

Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.

But I'm happy to include your approach in the final document!

Even if I'm wrong about being able to formalize goal-directedness, I'm pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.

Can you elaborate on this?

To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?

Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can't specify our utility function, which is one reason we may want to lean on imitation, of course). 

But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.

That being said, wouldn't the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?

Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity.

I don't understand what you're saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all.

I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X).

A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.

The issue here is that we don't have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can't discuss the asymptotic complexity of human-brain algorithms. 

I agree than in principle we could decode the brain's algorithms and say "actually, that's quadratic time" or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don't think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there's nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.

But maybe you actually have an argument related to that which I missed?

I think the crux here is what we're measuring runtime as-a-function-of. LMK if you still think something else is going on.

About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it's not redescribed in the intro or conclusion):

I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn't find a place where it felt super appropriate to discuss it.

One point about the time/description complexity penalty that I feel you don't point enough is that even if there was a threshold under which mesa-optimization doesn't appear, maybe it's just too low to be competitive. That's my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.

Right. I just didn't discuss this due to wanting to get this out as a quick sketch of where I'm going. 

Even with a significantly improved definition of goal-directedness, I think we'd be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.

Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it's only the first step. That being said, I think I'm more optimistic than you on the result, for a couple of reasons:

  • One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it's possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
  • One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
  • Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.

Can you elaborate on this?

What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.

That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.

But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we're bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.

Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it's really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.

One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.

Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I'd be happy to hear it, but "systems performing complex tasks in complex environments have to pay that cost anyway" seems like a big problem for arguments of this kind. The question becomes where they put the complexity.

Agreed. I don't have such a story, but I think this is a good reframing of the crux underlying this line of argument.

I meant time as a function of data (I'm not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you'd need O(X).

For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I'm not sure why piling on more data wouldn't make the reliance on memory more difficult (so something like O(X^2) ?), but I don't think it's that important.

I agree than in principle we could decode the brain's algorithms and say "actually, that's quadratic time" or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don't think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there's nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.

Agreed that this a pretty strong argument that complexity doesn't preclude mesa-optimizers.

I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn't find a place where it felt super appropriate to discuss it.

Maybe in "Why this doesn't seem to work" for pure computational complexity?

What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.

That's one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.

Ah, on this point, I very much agree.

I'm not sure why piling on more data wouldn't make the reliance on memory more difficult (so something like O(X^2) ?), but I don't think it's that important.

I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn't quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).

I admit that I'm more excited by doing this because you're asking it directly, and so I actually believe there will be some answer (which in my experience is rarely the case for my in-depth comments).

Thanks!

I'm not sure if I agree that there is no connection. The mesa-objective comes from the interaction of the outer objective, the training data/environments and the bias of the learning algorithm. So in some sense there is a connection. Although I agree that for the moment we lack a formal connection, which might have been your point.

Right. By "no connection" I specifically mean "we have no strong reason to posit any specific predictions we can make about mesa-objectives from outer objectives or other details of training" -- at least not for training regimes of practical interest. (I will consider this detail for revision.)

I could have also written down my plausibility argument (that there is actually "no connection"), but probably that just distracts from the point here.

(More later!)

To state the least of our problems first: this requires a 100x slowdown in comparison with the state-of-the-art deep learning (or whatever) we're layering the consensus algorithm on top of

I think you’re imagining deep learning as a MAP-type approach—it just identifies a best hypothesis and does inference with that. Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.
 

Third, the consensus algorithm requires a strong form of realizability assumption, where you not only assume that our Bayesian space contains the true hypothesis, but furthermore, that it's in the top 100 (or whatever number we choose). This hypothesis has to be really good: we have to think that malign hypotheses never out-guess the benign hypothesis. Otherwise, there's a chance that we eliminate the good guy at some point (allowing the bad guys to coordinate on a wrong answer). But this is unrealistic! The world is big and complex enough that no realistic hypothesis has all the answers.

I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.

Michael Cohen seems to think that restricting to imitation learning makes the realizability assumption realistic

Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice. 

When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well. Realistically, we will have some hypothesis-proposing heuristic, strong enough to identify models one of which is accurate enough to generate powerful agency. This heuristic will clearly cast a wide net (if not, how would it magically arrive at a good answer? It’s internals would need some hypothesis-generating function). Rather than throwing out the runner ups, the consensus algorithm stores them. The hypothesis generating heuristic is another attack surface for optimization daemons, and I won’t make any claims for now about how easy or hard it is to prevent such a thing.

to apply this to something like deep learning, we need to think that each run has an independent chance of creating safe AGI

Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.

Fourth, the consensus algorithm still gives inner optimizers control of when the system asks for more feedback. This in itself could be catastrophic! For example, it could ask at the worst times, when it anticipates the human feedback to be of poor quality, or when asking itself has catastrophic consequences. A silly example would be causing a self-driving car to ask for feedback right when it needs to enact difficult evasive maneuvers to avoid a crash.

It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.

Fifth, although (as Michael Cohen notes) the frequency of asking for feedback will rapidly diminish, the learning algorithm converges to the correct hypothesis extremely slowly… We have to face the fact that it might require human feedback at any point in the future.

Yeah this feels like a small cost to me. One person can be doing this for many instances at once. If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.

The fourth point [controlling when to ask for more feedback] really seems like the deal-breaker. Malign mesa-optimizers could intelligently leverage this to communicate with the outside world, manipulate the training process, etc.

A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed "now!". If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn't forget to put a stake in the ground marking "lots of progress". If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.

My third and final example: in one conversation, someone made a claim which I see as "exactly wrong": that we can somehow lower-bound the complexity of a mesa-optimizer in comparison to a non-agentic hypothesis (perhaps because a mesa-optimizer has to have a world-model plus other stuff, where a regular hypothesis just needs to directly model the world). This idea was used to argue against some concern of mine.

The problem is precisely that we know of no way of doing that! If we did, there would not be any inner alignment problem! We could just focus on the simplest hypothesis that fit the data, which is pretty much what you want to do anyway!

Maybe this was someone else, but it could have been me. I think MAP probably does solve the inner alignment problem in theory, but I don’t expect to make progress resolving that question, and I’m interested in hedging against being wrong. Where you say, “We know of no way of doing that” I would say, “We know of ways that might do that, but we’re not 100% sure”. On my to-do list is to write up some of my disagreements with Paul’s original post on optimization daemons in the Solomonoff prior (and maybe with other points in this article). I don’t think it’s good to argue from the premise that a problem is worth taking seriously, and then see what follows from the existence of that problem, because a problem can exist with 10% probability and be worth taking seriously, but one might get in trouble embedding its existence too deeply in one’s view of the world, if it is still on balance unlikely. That’s not to say that most people think Paul’s conclusions are <90% likely, just that one might.

Thanks for the extensive reply, and sorry for not getting around to it as quickly as I replied to some other things!

I am sorry for the critical framing, in that it would have been more awesome to get a thought-dumb of ideas for research directions from you, rather than a detailed defense of your existing work. But of course existing work must be judged, and I felt I had remained quiet about my disagreement with you for too long.

Comparing the consensus algorithm with (pure, idealized) MAP, 1) it is no slower, and 2) the various corners that can be cut for MAP can be cut for the consensus algorithm too. 

It's a fair point that it's no slower than idealized MAP. But the most important corner cut by deep learning is that deep learning represents just one hypothesis at a time, searching the space by following a gradient rather than by explicitly comparing options. The question is, how can we cut the same corner for the consensus algorithm, which needs to compare the outputs of many hypotheses?

In some settings, this is possible: for sufficiently simple hypothesis spaces, we can check consensus without explicitly computing a bunch of hypotheses. However, for deep learning, it seems rather difficult.

So, it seems like the best we can expect to do for deep learning is to train and run 100 hypotheses (or whatever number). This is a huge approximation in terms of MAP (since we have no guarantees that we are finding the 100 most probable, or anything), but we can naturally re-frame the consensus-alg guarantee in terms of frequency-of-malign-results for the NN training (rather than an assumption about at least 1 of the 100 most probable hypotheses being non-malign). 

But this still means that, for a consensus of N hypotheses, the consensus algorithm will be N times slower (in terms of both training time and inference time). I expect N to be quite large, for reasons similar to what I said in the post: not only do we have to think N is large enough that at least one of the hypotheses is benign, but also we have to think that the benign hypothesis is at least as capable as any of the malign hypotheses (because otherwise it could get unlucky and be eliminated). For the purpose of imitation learning, this means we think one of the hypotheses has exactly learned to imitate a human. Since this problem is going to be quite under-determined even with quite large amounts of data, it seems like N needs to be large enough to chance upon it reliably.

Starting with 1), the bulk of the work for either the consensus algorithm or a MAP approach is computing the posterior to determine which model(s) is(are) best. In an analogy to neural networks, it would be like saying most of the work comes from using the model (the forward pass) rather than arriving at the model (the many forward and backward passes in training). Regarding 2), state-of-the-art-type AI basically assumes approximate stationarity when separating a training phase from a test/execution phase. This is cutting a huge corner, and it means that when you think of a neural network running, you mostly think about it using the hypothesis that it has already settled on. But if we compare apples to apples, a consensus algorithm can cut the same corner to some extent. Neither a MAP algorithm nor a consensus algorithm is any better equipped than the other to, say, update the posterior only when the timestep is a power two. 

My main point here is: this isn't the most important corner I see deep learning cutting, as I described above.

In general, training (be it SGD or posterior updating) is the vast bulk of the work in learning. To select a good hypothesis in the first place you will have already had to consider many more; the consensus algorithm just says to keep track of the runner ups.

"Keeping track of runner-ups" in the MAP case makes a lot of sense. But for the deep learning case, it sounds like you are suggesting that we do consensus on part of the path that a single training run takes in hypothesis space. This seems like a pretty bad idea, for several reasons:

  1. They will all be pretty similar, so getting consensus doesn't tell us much. We generally have no reason to assume that some point along the path will be benign -- undercutting the point of the consensus algorithm.
  2. The older parts of the path will basically be worse, so if you keep a lot of path, you get a lot of not-very-useful failures of consensus.
  3. Lottery-ticket research suggests that if a malign structure is present in the end, then precursors to it will be present at the beginning.

So it seems to me that you at least need to do independent training runs (w/ different random initializations) for the different models which you are checking consensus between, so that they are "independent" in some sense (perhaps most importantly, drawing different lottery tickets).

However, running the same training algorithm many times may not realistically explore the space enough. We sort of expect the same result from the same training procedure. Sufficiently large models will contain malign lotto tickets with high probability (so we can't necessarily argue from "one of these N initializations almost certainty lacks a malign lotto ticket" without very high N). The gradient landscape contains the same demons; maybe the chances of being pulled into them during training are just quite high. All of this suggests that N may need to be incredibly high, or, other measures may need to be taken to ensure that the consensus is taken between a greater variety of hypotheses than what we get from re-running training.

I don’t understand what out-guess means. But what we need is that the malign hypothesis don’t have substantially higher posterior weight than the benign ones. 

Right, that's what I meant. Suppose we're trying to imitation-learn about Sally. Sally has a bunch of little nuances to her personality. For example, she has Opinions about flowers, butter, salt.... and a lot of other little things, which I'm supposing can't be anticipated from each other. I'm suggesting that no Bayesian hypothesis can get all of those things right from the get go. So imagine that after a while, the top 100 hypotheses are all pretty good at modeling Sally in typical situations, but each one has a different "sally secret" -- a different lucky guess about one of these little things.

In particular, a benign hypothesis (let's call it Ted) knows about the flower thing, and a malign hypothesis (Jerry) knows about a butter thing.

Unexpectedly, the butter thing becomes a huge focus of Sally's life for a while. Ted falls far out of favor compared to Jerry, since Ted just didn't see this coming. Maybe Ted updates pretty quickly, but it's too late, Ted has lost a bunch of Bayes points.

Whereas if the flower thing had come up, instead, we could consider ourselves lucky; Ted would still be in the running.

With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.

That's the general idea I had in mind. The point is that N has to be high enough that we don't expect this to happen at any point.

As time passes, the probability of this happening is not independent. The result I show about the probability of the truth being in the top set applies to all time, not any given point in time. I don’t know what “no realistic hypothesis has all the answers” means. There will be a best “realistic” benign hypothesis, and we can talk about that one.

Why will there be one best? That's the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top. (This is one of the classic frequentist critiques of bayesianism.) That's the formal thing that my "flowers vs butter" story about Sally is supposed to point at.

We can do better with logical induction or infrabayesianism. But I'm still leery of consensus-type approaches for those on other grounds.

Realistic in theory! Because the model doesn’t need to include the computer. I do not think we can actually compute every hypothesis simpler than a human brain in practice. 

When you go from an idealized version to a realistic one, all methods can cut corners, and I don’t see a reason to believe that the consensus algorithm can’t cut corners just as well.

Haha :) OK, I misinterpreted you.

But the idealized issues (can a bayesian hypothesis model the computer it is running on?) have practical analogues (can we expect hypothesis generation to produce one model which is uniquely best, or will different models simply "know different things"?). So when I judge an idealized algorithm I think about those practical analogues, and whether they seem to suggest problems for realistic approximations.

In particular, you want your safety argument to translate well, since otherwise, what's the point?

And for the consensus algorithm, I've already explained why I do think it's particularly bad in this way.

Evan and I talked along these lines for a bit. My basic position is that if “local search” is enough to get to general intelligence, our algorithms will be searching in spaces (or regions) where diverse hypothesis are close. Diverse hypothesis generation is just crucial for general intelligence. I do not advocate training GPT-N with 10^100 different initializations. I don’t think you have to, and I don’t think it would help much.

Right, so what I'm getting is that we can think of GPT-N as already containing a large ensemble. But how would you check consensus?

It’s imitation learning so whatever the person would do when stepping in is what the algorithm learns to do when they don’t. If the human action would be of poor quality if the human were called upon to take the action, then all the reasonable models, benign ones included, will predict that poor action, and the poor action will be taken regardless of whether the human was brought into the loop. If a person would be unable to avoid a crash when stepping in at the last minute, the imitation learner will crash the car too! If 99 hypotheses are correctly predicting the human action, then the 100th hypotheses can correctly predict it too, and the human action is executed, or it can err deliberately, in which case the human is called and the human action is executed anyway. You might have to modify this protocol a bit before using it to learn human policies for operating heavy machinery.

So, what, the human has to act like they're driving the car forever, even though the AI autopilot is really good? If not, then there are many scenarios where the human would do well if driving normally but would crash if they were in the middle of sipping their drink when the feedback call unexpectedly came.

In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like "case closed" to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.

If you told me "actually there is no feedback call button, the human is constantly monitored, the only difference is whether the data is used" then I'd concede that point; but then it seems like we're stuck driving the car forever.

If you want to mimic a human that has been attentive for a given period of time (because you want to know how a human drives a car, having already gotten their bearings previously, rather than how human drives a car having just been transported into it) you can know a few timesteps in advance if there is any likely circumstance in which you will need to query the demonstrator, and you can use the lead time to get the demonstrator acquainted. The query complexity will go up of course because there will be false alarms, but this is a totally surmountable problem for active learning.

Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?

A malign mesa-optimizer has one moment to err, and then it will be discarded from the posterior. There’s not much it could say to us with a well-timed "now!". If multiple mesa-optimzers coordinate around a message, any mesa-optimizer not on board with this mission could disrupt the message. If you like, we could add a 1/million chance of querying to every timestep. Finally, with butterfly-effects, I don’t think that even a billion well-timed asks for help could precipitate human extinction. This is a bit all over the map, and this may be the point you want to press, but if you do, I think it is worth logging that if this is only attack surface left, we shouldn't forget to put a stake in the ground marking "lots of progress". If you believe that a single malign world-model could derail us if we use a Solomonoff predictor, but a consensus predictor would only be derailed by a supermajority of malign world-models, aligned with each other, that is a qualitative difference.

This strikes me as a disorganized mix of arguments, and I've kind of run out of time atm to write this response, sorry! I will think more on what you have said.

Maybe this was someone else, but it could have been me.

I like the plausible deniability my anonymised examples gave me ;3

I felt I had remained quiet about my disagreement with you for too long

Haha that's fine. If you don't voice your objections, I can't respond to them!

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".  Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! But it also seems like a clear problem, maybe even tractable. See Taylor (2016) section 2.1--inductive ambiguity identification. If you were convinced that AGI will be made of neural networks, you could say that I have reduced the problem of inner alignment to the problem of diverse-model-extraction from a neural network, perhaps allowing a few modifications to training (if you bought that the claim that the consensus algorithm is a theoretical solution). I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks; my ideal situation would be that I figure out how to do something in theory, and then 50 people try to work on analogizing it to state-of-the-art AI (there are many more neural network experts out there than AIXI experts). My less ideal situation is that people provisionally treat the theoretical solution as a dead end, right up until the very point that a practical version is demonstrated.

If it seemed like solving inner alignment in theory was easy (because allowing yourself an agent with the wherewithal to consider "unrealistic" models is such a boon), and there were thus lots of theoretical solutions floating around, any given one might not be such a strong signal: "this is the place to look for realistic solutions". But if there's only one floating around, that's a very a strong signal that we might be looking in a fundamental part of the solution space. In general, I think the most practical place to look for practical solutions is near the best theoretical one, and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".

I think this covers my take on a few of your points, but some of your points are separate. In particular, some of them bear on the question of whether this really is an idealized solution in the first place.

With Ted out of the running, the top 100 hypotheses are now all malign, and can coordinate some sort of treacherous turn.

I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?" I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.

(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).

Why will there be one best? That's the realizability assumption. There is not necessarily a unique model with lowest bayes loss. Another way of stating this is that Bayesian updates lack a convergence guarantee; hypotheses can oscillate forever as to which is on top.

Yeah I was thinking that the realistic setting was a finite length setting, with the one best being the best at the end. (And if it is best at the end, you can put a lower bound on how low its posterior weight ever was, since it's hard to recover from having vanishingly small weight, and then alpha just needs to be set to include that). But fair enough to be interested an infinite lifetime with a finite model class that does not include the truth. So yeah, a model's ranking can oscillate forever, although I think intelligent systems won't really do this in practice? I think in an infinite lifetime, it is reasonable in practice to assume that if something is generally intelligent, it has some sort of ability to generate new hypotheses over time. And if you have two hypotheses about Sally, one which explains her relationship to butter, and one which explains her relationship to flowers, it's pretty easy to generate another that combines these. So I'm not sure how instructive it is to consider the finite model class/infinite horizon Bayesian sequence prediction as a model for state-of-the-art AI. 

So, what, the human has to act like they're driving the car forever, even though the AI autopilot is really good?

The suggestion I make later is that a few timesteps in advance, the human is told to start acting like they're driving the car, but they're not controlling it yet, and then comes the timestep when autopilot actually defers to them, and that's the behavior that gets recorded. Your response is

Is this procedure supposed to strike me as so airtight that an inner optimizer could never exploit it?

That proposal was making a different point--it was refuting exactly the concern that the human would have to act like they're driving the car forever. This protocol just defends active imitation learning in general as potentially viable for the operation of heavy machinery. Assume the demonstrator is sitting in a driving simulator, and all the relevant info shows up on their screens; this allows a small team to be the demonstrators for many autopilots simultaneously. It's not a question of whether an inner optimizer could exploit this protocol by finding the most inconvenient time for the human to be called in, and then querying the human--it's much worse than that! If there's a bad time for the human drive to be called in, then the autopilot will fail at that time regardless; the inner optimizer doesn't even have to arrange an actual query. So there's nothing to "exploit" really. And defending against the "much worse" version, I do think that the policy of a human driver with 5 seconds of acquaintance with the current state, does not crash the car at a rate significantly higher than a real human driver. Also, it may be worth saying that I don't think we really need general intelligence to operate heavy machinery at human level, so I'm not sure the heavy machinery case is very important to consider in the first place.

In terms of evaluating this on theoretical terms, the feedback call button is an output channel that malign stuff has access to. This feels kind of like "case closed" to me; I take it as a given that a sufficiently intelligent malign actor could exploit this.

You said you'd continue to think about the points I made about why I don't have the same reaction to this, so we can come back to those later. But stepping back, I think I want to make a point about the information loss you impose on your beliefs when applying the security mindset or Murphy's Law. A malign model gets to say "query now" once! And then it is removed from the posterior because of its wrong prediction. That is not many bits for an output channel.  But... yes, it is an output channel. And if you insist that all output channels for malign stuff transmit world-ending signals, just like all aquariums will leak, I agree that that is sometimes a useful perspective to adopt and run with, but the exercise of making actual probability estimates about events is potentially different from exercising a security mindset. I mentioned information loss. With a security mindset, there is no point in distinguishing a malign agent with a 20 bit output channel from one with an unbounded output channel. You just close the case on both. But we really might live in a world where the former is existentially safe, and the latter is not--I mean that even seems likely to me. Even from my perspective, where I don't think a 20 bit output channel is dangerous, I understand the merit in disregarding that belief as the motivation for a research agenda, but I do think that's a whole different sort of "epistemic status". So when you say "I take it as a given that a sufficiently intelligent malign actor could exploit this" that makes sense to me as an action, but not as a proposition. (I mean, I think it's incorrect, so I certainly don't think it's self-evidently true).

I guess I can expand on the object level too. Like, take a moment to really stare at 01110010101011010000, and imagine that you take it to be a signal from some deeply wise agent that arranged for a query at timestep number 469,712. Suppose further that you already think this deeply wise agent is not malign, but rather is much maligned! (So it doesn't even need to spend the first few bits enthralling you). What do you do now? Or maybe timestep 469, 712 doesn't mean that bitstring--maybe it means "this is a very meaningful world-state, and of similar world-states, the fact that this one was chosen tells me I should run program X on a supercomputer, so it can engage in some good, clean world-optimization". Look around you right now--if someone told you "now is important, so think about that, and go run a program", what program do you produce? Am I taking a huge risk even asking that, because there's a precise time such that if I did, you'd run that program and it would end the world?

I think let's step back for a second, though. Suppose you were in the epistemic position "yes, this works in theory, with the realizability assumption, with no computational slowdown over MAP, but having spent 2-10 hours trying to figure out how to distill a neural network's epistemic uncertainty/submodel-mismatch, and having come up blank..." what's the conclusion here? I don't think it's "my main guess is that there's no way to apply this in practice".

A couple of separate points:

  • My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.
  • I feel like there's something a bit wrong with the 'theory/practice' framing at the moment. My position is that certain theoretical concerns (eg, embeddedness) have a tendency to translate to practical concerns (eg, approximating AIXI misses some important aspects of intelligence). Solving those 'in theory' may or may not translate to solving the practical issues 'in practice'. Some forms of in-theory solution, like setting the computer outside of the universe, are particularly unrelated to solving the practical problems. Your particular in-theory solution to embeddedness strikes me as this kind. I would contest whether it's even an in-theory solution to embeddedness problems; after all, are you theoretically saying that the computer running the imitation learning has no causal influence over the human being imitated? (This relates to my questions about whether the learner specifically requests demonstrations, vs just requiring the human to do demonstrations forever.) I don't really think of something like that as a "theoretical solution" to the realizability probelm at all. That's reserved for something like logical induction which has unrealistically high computational complexity, but does avoid a realizability assumption.

 Even if you had spent all the time since my original post trying to figure out how to efficiently distill a neural network's epistemic uncertainty, it's potentially a hard problem! [...] I have never tried to claim that analogizing this approach to neural networks will be easy, but I don't think you want to wait to hear my formal ideas until I have figured out how to apply them to neural networks;

Yeah, this is a fair point.

and 10 hours of unsuccessful search isn't even close to the amount of time needed to demote that area from "most promising".

To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about. I just perceive it to have hit diminishing returns. (This doesn't mean no one should ever think about it again, but it does seem worth communicating why the direction hasn't born fruit, at least to the extent that that line of research is happy being public.)

I think the question we are discussing here is: "yes, with the realizability assumption, existence of a benign model in the top set is substantially correlated over infinite time, enough so that all we need to look at is the relative weight of malign and benign models, BUT is the character of this correlation fundamentally different without the realizability assumption?" 

Sounds right to me.

I don't see how this example makes that point. If the threshold of "unrealistic" is set in such a way that "realistic" models will only know most things about Sally, then this should apply equally to malign and benign models alike. (I think your example respects that, but just making it explicit). However, there should be a benign and malign model that knows about Sally's affinity for butter but not her allergy to flowers, and a benign and a malign model that knows the opposite. It seems to me that we still end up just considering the relative weight of benign and malign models that we might expect to see.

Ah, ok! Basically this is a new way of thinking about it for me, and I'm not sure what I think yet. My picture was that we argue that the top-weighted "good" (benign+correct) hypothesis can get unlucky, but should never get too unlucky, such that we can set N so that the good guy is always in the top N. Without realizability, we would have no particular reason to think "the good guy" (which is now just benign + reasonably correct) never drops below N on the list, for any N (because oscillations can be unbounded).

(A frugal hypothesis generating function instead of a brute force search over all reasonable models might miss out on, say, the benign version of the model that understands Sally's allergies; I do not claim to have identified an approach to hypothesis generation that reliably includes benign models. That problem could be one direction in the research agenda of analogizing this approach to state-of-the-art AI. And this example might also be worth thinking about in that project, but if we're just using the example to try to evaluate the effect of just removing the realizability assumption, but not removing the privilege of a brute search through reasonable models, then I stand by the choice to deem this paragraph parenthetical).

I don't really get why yet -- can you spell the (brute-force) argument out in more detail?

(going for now, will read+reply more later)

A few quick thoughts, and I'll get back to the other stuff later.

To be clear, people I know spent a lot more time than that thinking hard about the consensus algorithm, before coming to the strong conclusion that it was a fruitless path. I agree that this is worth spending >20 hours thinking about.

That's good to know. To clarify, I was only saying that spending 10 hours on the project of applying it to modern ML would not be enough time to deem it a fruitless path. If after 1 hour, you come up with a theoretical reason why it fails on its own terms--i.e. it is not even a theoretical solution--then there is no bound on how strongly you might reasonably conclude that it is fruitless. So this kind of meta point I was making only applied to your objections about slowdown in practice.

a "theoretical solution" to the realizability probelm at all.

I only meant to claim I was just doing theory in a context that lacks the realizability problem, not that I had solved the realizability problem! But yes, I see what you're saying. The theory regards a "fair" demonstrator which does not depend on the operation of the computer. There are probably multiple perspectives about what level of "theoretical" that setting is. I would contend that in practice, the computer itself is not among the most complex and important causal ancestors of the demonstrator's behavior, so this doesn't present a huge challenge for practically arriving at a good model. But that's a whole can of worms.

My main worry continues to be the way bad actors have control over an io channel, rather than the slowdown issue.

Okay good, this worry makes much more sense to me.

Just want to note that although it's been a week this is still in my thoughts, and I intend to get around to continuing this conversation... but possibly not for another two weeks.

I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)

The Evolutionary Story

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.

So I guess I can imagine a strategy of saying "mesa-optimization won't happen" in some circumstance because we've somehow ruled out all three of those categories.

This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.

...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of "mesa-optimization in predictive learning" entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.

Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that's part of an argument that such mesa-optimizers are improbable. If that argument is correct, then the worst we would expect from a "misaligned mesa-optimizer" is that it will use an inappropriate prediction heuristic in some circumstances, and then we'd wind up with inaccurate predictions. That's a capability problem, not a safety problem.

So anyway, if there's a good argument along those lines, it would not be a safety argument that involves "There will be no mesa-optimizers", but rather "There will be no mesa-optimizers that think outside the box", so to speak. Details and (sketchy) argument in a forthcoming post.

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.

We can also get a model that has an objective that is different from the intended formal objective (never mind whether the latter is aligned with us). For example, SGD may create a model with a different objective that is identical to the intended objective just during training (or some part thereof). Why would this be unlikely? The intended objective is not privileged over such other objectives, from the perspective the training process.

Evan gave an example related to this, where the intention was to train a myopic RL agent that goes through blue doors in the current epoch episode, but the result is an agent with a more general objective that cares about blue doors in future epochs episodes as well. In Evan's words (from the Future of Life podcast):

You can imagine a situation where every situation where the model has seen a blue door, it’s been like, “Oh, going through this blue is really good,” and it’s learned an objective that incentivizes going through blue doors. If it then later realizes that there are more blue doors than it thought because there are other blue doors in other episodes, I think you should generally expect it’s going to care about those blue doors as well.

Similar concerns are relevant for (self-)supervised models, in the limit of capability. If a network can model our world very well, the objective that SGD yields may correspond to caring about the actual physical RAM of the computer on which the inference runs (specifically, the memory location that stores the loss of the inference). Also, if any part of the network, at any point during training, corresponds to dangerous logic that cares about our world, the outcome can be catastrophic (and the probability of this seems to increase with the scale of the network and training compute).

Also, a malign prior problem may manifest in (self-)supervised learning settings. (Maybe you consider this to be a special case of (2).)

Like, if we do gradient descent, and the training signal is "get a high score in PacMan", then "mesa-optimize for a high score in PacMan" is incentivized by the training signal, and "mesa-optimize for making paperclips, and therefore try to get a high score in PacMan as an instrumental strategy towards the eventual end of making paperclips" is also incentivized by the training signal.

For example, if at some point in training, the model is OK-but-not-great at figuring out how to execute a deceptive strategy, gradient descent will make it better and better at figuring out how to execute a deceptive strategy.

Here's a nice example. Let's say we do RL, and our model is initialized with random weights. The training signal is "get a high score in PacMan". We start training, and after a while, we look at the partially-trained model with interpretability tools, and we see that it's fabulously effective at calculating digits of π—it calculates them by the billions—and it's doing nothing else, it has no knowledge whatsoever of PacMan, it has no self-awareness about the training situation that it's in, it has no proclivities to gradient-hack or deceive, and it never did anything like that anytime during training. It literally just calculates digits of π. I would sure be awfully surprised to see that! Wouldn't you? If so, then you agree with me that "reasoning about training incentives" is a valid type of reasoning about what to expect from trained ML models. I don't think it's a controversial opinion...

Again, I did not (and don't) claim that this type of reasoning should lead people to believe that mesa-optimizers won't happen, because there do tend to be training incentives for mesa-optimization.

I would sure be awfully surprised to see that! Wouldn't you?

My surprise would stem from observing that RL in a trivial environment yielded a system that is capable of calculating/reasoning-about . If you replace the PacMan environment with a complex environment and sufficiently scale up the architecture and training compute, I wouldn't be surprised to learn the system is doing very impressive computations that have nothing to do with the intended objective.

Note that the examples in my comment don't rely on deceptive alignment. To "convert" your PacMan RL agent example to the sort of examples I was talking about: suppose that the objective the agent ends up with is "make the relevant memory location in the RAM say that I won the game", or "win the game in all future episodes".

My hunch is that we don't disagree about anything. I think you keep trying to convince me of something that I already agree with, and meanwhile I keep trying to make a point which is so trivially obvious that you're misinterpreting me as saying something more interesting than I am.

I guess at the end of the day I imagine avoiding this particular problem by building AGIs without using "blind search over a super-broad, probably-even-Turing-complete, space of models" as one of its ingredients. I guess I'm just unusual in thinking that this is a feasible, and even probable, way that people will build AGIs... (Of course I just wind up with a different set of unsolved AGI safety problems instead...)

Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??

I think any prosaic story involves blind search over a super-broad space of models, unless/until the prosaic methodology changes, which I don't particularly expect it to.

I agree that replacing "blind search" with different tools is a very important direction. But your proposal doesn't do that!

By and large, we expect trained models to do (1) things that are directly incentivized by the training signal (intentionally or not), and (2) things that are indirectly incentivized by the training signal (they're instrumentally useful, or they're a side-effect, or they “come along for the ride” for some other reason), (3) things that are so simple to do that they can happen randomly.

So I guess I can imagine a strategy of saying "mesa-optimization won't happen" in some circumstance because we've somehow ruled out all three of those categories.

This kind of argument does seem like a not-especially-promising path for safety research, in practice. For one thing, it seems hard. Like, we may be wrong about what’s instrumentally useful, or we may overlook part of the space of possible strategies, etc. For another thing, mesa-optimization is at least somewhat incentivized by seemingly almost any training procedure, I would think.

I agree with this general picture. While I'm primarily knocking down bad complexity-based arguments in my post, I would be glad to see someone working on trying to fix them.

...Hmm, in our recent conversation, I might have said that mesa-optimization is not incentivized in predictive (self-supervised) learning. I forget. But if so, I was confused. I have long believed that mesa-optimization is useful for prediction and still do. Specifically, the directly-incentivized kind of "mesa-optimization in predictive learning" entails, for example, searching over different possible approaches to process the data and generate a prediction, and then taking the most promising approach.

There were a lot of misunderstandings in the earlier part of our conversation, so, I could well have misinterpreted one of your points.

But if so, I'm even more struggling to see why you would have been optimistic that your RL scenario doesn't involve risk due to unintended mesa-optimization.

Anyway, what I should have said was that, in certain types of predictive learning, mesa-optimizers that search over active, real-world-manipulating plans are not incentivized—and then that's part of an argument that such mesa-optimizers are improbable. 

By your own account, the other part would be to argue that they're not simple, which you haven't done. They're not actively disincentivized, because they can use the planning capability to perform well on the task (deceptively). So they can be selected for just as much as other hypotheses, and might be simple enough to be selected in fact.

Wait, you think your prosaic story doesn't involve blind search over a super-broad space of models??

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Like, I tend to emphasize the overlap between my brain-like AGI story and prosaic AI. There is plenty of overlap. Like they both involve "neural nets", and (something like) gradient descent, and RL, etc.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.

In that case, yes there's a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don't think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard... :-P

No, not prosaic, that particular comment was referring to the "brain-like AGI" story in my head...

Ah, ok. It sounds like I have been systematically mis-perceiving you in this respect.

By contrast, I haven't written quite as much about the ways that my (current) brain-like AGI story is non-prosaic. And a big one is that I'm thinking that there would be a hardcoded (by humans) inference algorithm that looks like (some more complicated cousin of) PGM belief propagation.

I would have been much more interested in your posts in the past if you had emphasized this aspect more ;p But perhaps you held back on that to avoid contributing to capabilities research.

In that case, yes there's a search over a model space, because we need to find the (more complicated cousin of a) PGM world-model. But I don't think that model space affords the same opportunities for mischief that you would get in, say, a 100-layer DNN. Not having thought about it too hard... :-P

Yeah, this is a very important question!

I like your agenda. Some comments....

The benefit of formalizing things

First off, I'm a big fan of formalizing things so that we can better understand them. In the case of AI safety that, better understanding may lead to new proposals for safety mechanisms or failure mode analysis.

In my experience, once you manage to create a formal definition, it seldom captures the exact or full meaning you expected the informal term to have. Formalization usually exposes or clarifies certain ambiguities in natural language. And this is often the key to progress.

The problem with formalizing inner alignment

On this forum and in the broader community. I have seen a certain anti-pattern appear. The community has so far avoided getting too bogged down in discussing and comparing alternative definitions and formalization's of the intuitive term intelligence.

However, it has definitely gotten bogged down when it comes to the terms corrigibility, goal-directedness, and inner alignment failure. I have seen many cases of this happening:

The anti-pattern goes like this:

participant 1: I am now going to describe what I mean with the concept of corrigibility, goal-directedness,inner alignment failure, as first step to make progress on this problem of .

participants 2-n: Your description does not correspond to my intuitive concept of at all! Also, your steps 2 and 3 seem to be irrelevant to making progress on my concept of , because of the following reasons.

In this post on corrigibility I have have called corrigibility a term with a high linguistic entropy, I think the same applies to the other two terms above.

These high-entropy terms seem to be good at producing long social media discussions, but unfortunately these discussions seldom lead to any conclusions or broadly shared insights. A lot of energy is lost in this way. What we really want, ideally, is useful discussion about the steps 2 and 3 that follow the definitional step.

On the subject of offering formal versions of inner alignment, you write:

A weakness of this as it currently stands is that I purport to offer the formal version of the inner optimization problem, but really, I just gesture at a cloud of possible formal versions.

My recommendation would be to see the above weakness as a feature, not a bug. I'd be interested in reading posts (or papers) where you pick one formal problem out of this cloud and run with it, to develop new proposals for safety mechanisms or failure mode analysis.

Some technical comments on the formal problem you identify

From your section 'the formal problem', I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.

You then consider the question if these failure modes could be suppressed by somehow limiting the complexity of the 'inner optimization' process, limited so that it is no longer capable of finding the unwanted 'malign' solutions. I'll give you my personal intuition on that approach here, by way of an illustrative example.

Say we have a shepherd who wants to train a newborn lion as a sheepdog. The shepherd punishes the lion whenever the lion tries to eat a sheep. Now, once the lion is grown, it will either have internalized the goal of not eating sheep but protecting them, or the goal of not getting punished. If the latter, the lion may at one point sneak up while the shepherd is sleeping and eat the shepherd.

It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion's environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.

That being said, I can interpret Cohen's imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion's thinking.

If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion's reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.

Reward hacking

I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.

As 'not intended by the original designers' is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.

From your section 'the formal problem', I gather that the problems you associate with inner alignment failures are those that might produce treacherous turns or other forms of reward hacking.

It's interesting that you think of treacherous turns as automatically reward hacking. I would differentiate reward hacking as cases where the treacherous turn is executed with the intention of taking over control of reward. In general, treacherous turns can be based on arbitrary goals. A fully inner-aligned system can engage in reward hacking.

It seems to me that the possibility of this treacherous turn happening is encoded from the start into the lion's environment and the ambiguity inherent in their reward signal. For me, the design approach of suppressing the treacherous turn dynamic by designing a lion that will not be able to imagine the solution of eating the shepherd seems like a very difficult one. The more natural route would be to change the environment or reward function.

That being said, I can interpret Cohen's imitation learner as a solution that removes (or at least attempts to suppress) all creativity from the lion's thinking.

If you want to keep the lion creative, you are looking for a way to robustly resolve the above inherent ambiguity in the lion's reward signal, to resolve it in a particular direction. Dogs are supposed to have a mental architecture which makes this easier, so they can be seen as an existence proof.

I think for outer-alignment purposes, what I want to respond here is "the lion needs feedback other than just rewards". You can't reliably teach the lion to "not ever each sheep" rather than "don't eat sheep when humans are watching" when your only feedback mechanism can only be applied when humans are watching.

But if you could have the lion imagine hypothetical scenarios and provide feedback about them, then you could give feedback about whether it is OK to eat sheep when humans are not around.

To an extent, the answer is the same with inner alignment: more information/feedback is needed. But with inner alignment, we should be concerned even if we can look at the behavior in hypothetical scenarios and give feedback, because the system might be purposefully behaving differently in these hypothetical scenarios than it would in real situations. So here, we want to provide feedback (or prior information) about which forms of cognition are acceptable/unacceptable in the first place.

I guess I should re-iterate that, though treacherous turns seem to be the most popular example that comes up when people talk inner optimizers, I see treacherous turns as just another example of reward hacking, of maximizing the reward signal in a way that was not intended by the original designers.

As 'not intended by the original designers' is a moral or utilitarian judgment, it is difficult to capture it in math, except indirectly. We can do it indirectly by declaring e.g. that a mentoring system is available which shows the intention of the original designers unambiguously by definition.

I guess I wouldn't want to use the term "reward hacking" for this, as it does not necessarily involve reward at all. The term "perverse instantiation" has been used -- IE the general problem of optimizers spitting out dangerous things which are high on the proxy evaluation function but low in terms of what you really want.

Planned summary for the Alignment Newsletter:

This post outlines a document that the author plans to write in the future, in which he will define the inner alignment problem formally, and suggest directions for future research. I will summarize that document when it comes out, but if you would like to influence that document, check out the post.

Brainstorming

The following is a naive attempt to write a formal, sufficient condition for a search process to be "not safe with respect to inner alignment".

Definitions:

: a distribution of labeled examples. Abuse of notation: I'll assume that we can deterministically sample a sequence of examples from .

: a deterministic supervised learning algorithm that outputs an ML model. has access to an infinite sequence of training examples that is provided as input; and it uses a certain "amount of compute" that is also provided as input. If we operationalize as a Turing Machine then can be the number of steps that is simulated for.

: The ML model that outputs when given an infinite sequence of training examples that was deterministically sampled from ; and as the "amount of compute" that uses.

: The accuracy of the model over (i.e. the probability that the model will be correct for a random example that is sampled from ).

Finally, we say that the learning algorithm Fails The Basic Safety Test with respect to the distribution if the accuracy is not weakly increasing as a function of .

Note: The "not weakly increasing" condition seems too strict weak. It should probably be replaced with a stricter condition, but I don't know what that stricter condition should look like.

Great post.

I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection.

Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.

What I'd like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.

In connection with this, it may be worth checking out out my old post where I try to to untangle capability from alignment in the context of a particular optimization problem. I now disagree with around 20% of what I wrote there, but I still think it was a decent first stab at formalizing some of the relevant definitions, at least from a particular viewpoint.

Curated. Solid attempt to formalize the core problem, and solid comment section from lots of people.