Short introduction

One frequently suggested strategy for aligning a sufficiently advanced AI is to observe--before the AI becomes powerful enough that 'debugging' the AI would be problematic if the AI decided not to let us debug it--whether the AI appears to be acting nicely while it's not yet smarter than the programmers.

Early testing obviously can't provide a statistical guarantee of the AI's future behavior. If you observe some random draws from Barrel A, at best you get statistical guarantees about future draws from Barrel A under the assumption that the past and future draws are collectively independent and identically distributed.

On the other hand, if Barrel A is similar to Barrel B, observing draws from Barrel A can sometimes tell us something about Barrel B even if the two barrels are not i.i.d.

Conversely, if observed good behavior while the AI is not yet super-smart, fails to correlate to good outcomes after the AI is unleashed or becomes smarter, then this is a "context change problem" or "context disaster". ^[1]

A key question then is how shocked we ought to be, on a scale from 1 to 10, if good outcomes in the AI's 'development' phase fail to match up with good outcomes in the AI's 'optimize the real world' phase? ^[2]

People who expect that AI alignment is difficult think that the degree of justified surprise is somewhere around 1 out of 10. In other words, that there are a lot of foreseeable issues that could cause a seemingly nice weaker AI to not develop into a nice smarter AI.

An extremely oversimplified (but concrete) fable that illustrates some of these possible difficulties might go as follows:

Some group or project has acquired a viable development pathway to AGI. The programmers think it is wise to build an AI that will make people happy. ^[3]

The programmers start by trying to train their AI to produce smiles. ^[4]

While the AI is young and has relatively few policy options, it can only make people smile by performing well on the tasks assigned to it.

At some point, the AI gets smarter and and able to do more complicated things with some simple robots it controls. In an unfortunate incident, the AI learns it is possible to produce smiles by tickling people.

In the aftermath of this incident, the programmers manage to reprogram the AI to not optimize for this particular kind of smile by this particular route.

The AI becomes smarter and able to model people better. After perusing a copy of Wikipedia, the AI realizes it can make people extremely 'happy', as the AI's utility function currently defines 'happiness', by administering heroin.

The AI remembers the previous incident in which the programmers altered the AI's utility function. The AI can model its programmers sufficiently well to know that the programmers would not want the AI to give people heroin. However, the most happiness the AI thinks its future self would be able to produce without heroin, is not nearly as much happiness as the AI thinks it could produce if its future self went on wanting to give people heroin.

The AI refrains from trying to administer any heroin right now, and behaves in a way it thinks will be maximally reassuring to its model of its programmers, in order to be able to produce more 'happiness' later.

Eventually, all reachable galaxies end up being transformed into tiny molecular computers implementing the simplest and cheapest forms of what the AI defines as 'happiness'. (And the simplest possible configurations matching the AI's utility function in this way are so simple as to be devoid of subjective experience; and hence, from our perspective, of neither negative nor positive value.)

In all these cases, the problem was not that the AI developed in an unstable way. The same decision system produced a new problem in the new context.

Currently argued foreseeable "context change problems" in this sense, can be divided into three broad classes:

More possibilities, more problems: The AI's preferences have a good or intended achievable optimum while the AI is picking from a narrow space of options. When the AI becomes smarter or gains more material options, it picks from a wider space of tractable policies and achievable outcomes. Then the new optimum is not as nice, because, for example:
- The AI's utility function was tweaked by some learning algorithm and data that eventually seemed to conform behavior well over options considered early on, but not the wider space considered later.
- In development, apparently bad system behaviors were patched in ways that appeared to work, but didn't eliminate an underlying tendency, only blocked one expression of that tendency. Later a very similar pressure re-emerged in an unblocked way when the AI considered a wider policy space.
- Goodhart's Curse suggests that if our true intended values V are being modeled by a utility function U, selecting for the highest values of U also selects for the highest upward divergence of U from V, and this version of the "optimizer's curse" phenomenon becomes worse as U is evaluated over a wider option space.

Treacherous turn: There's a divergence between the AI's preferences and the programmers' preferences, and the AI realizes this before we do. The AI uses the convergent strategy of behaving the way it models us as wanting or expecting, until the AI gains the intelligence or material power to implement its preferences in spite of anything we can do.

Revving into the red: Intense optimization causes some aspect or subsystem of the AI to traverse a weird new execution path in some way different from the above two issues. (In a way that involves a value-laden category boundary or multiple self-consistent outlooks, such that we don't get a good result just as a free lunch of the AI's general intelligence.)

The context change problem is a central issue of AI alignment and a key proposition in the general thesis of alignment_difficulty. If you could easily, correctly, and safely test for niceness by outward observation, and that form of niceness scaled reliably from weaker AIs to smarter AIs, that would be a very cheerful outlook on the general difficulty of the problem.

Technical introduction

John Danaher summarized as follows what he considered a forceful "safety test objection" to AI catastrophe scenarios:

Safety test objection: An AI could be empirically tested in a constrained environment before being released into the wild. Provided this testing is done in a rigorous manner, it should ensure that the AI is “friendly” to us, i.e. poses no existential risk.

The phrasing here of "empirically" and "safety test" implies that it is outward behavior or outward consequences that are being observed (empirically). Rather than, e.g., the engineers trying to test for some internal property that they think analytically implies the AI's good behavior later.

This page will consider that the subject of discussion is whether we can generalize from the AI's outward behavior. We can potentially generalize some of these arguments to some internal observables, especially observables that the AI is deciding in a consequentialist way using the same central decision system, or that the AI could potentially try to obscure from the programmers. But in general not all the arguments will carry over.

Another argument, closely analogous to Danaher's, would reason on capabilities rather than on a constrained environment:

Surely an engineer that exercises even a modicum of caution will observe the AI while its capabilities are weak to determine whether it is behaving well. After filtering out all such misbehaving weak AIs, the only AIs permitted to become strong will be of benevolent disposition.

If (as seems to have been intended) we take these twin arguments as arguing "why nobody ought to worry about AI alignment" in full generality, then we can list out some possible joints at which that general argument might fail:

Selecting on the fastest-moving projects might yield a project whose technical leaders fail to exercise even "a modicum of caution".

Alignment might be hard enough, relative to the amount of advance research done, that we can't find any AIs whose behavior while weak or constrained is as reassuring as the argument would properly ask. ^[5] After a span of frustration, somebody somewhere lowers their standards.

The attempt to isolate the AI to a constrained environment could fail, e.g. because the humans observing the AI themselves represent a channel of causal interaction between the AI and the rest of the universe. (Aka "humans are not secure".) Analogously, our grasp on what constitutes a 'weak' AI could fail, or it could gain in capability unexpectedly quickly. Both of these scenarios would yield an AI that had not passed the filtering procedure.

The smart form of the AI might be unstable with respect to internal properties that were present in the weak form. E.g., because the early AI was self-modifying but at that time not smart enough to understand the full consequences of its own self-modifications. Or because e.g. a property of the decision system was not reflectively stable.

A weak or contained form of a decision process that yields behavior appearing good to human observers, might not yield beneficial outcomes after that same decision process becomes smarter or less contained.

The final issue in full generality is what we'll term a 'context change problem' or 'context disaster'.

Observing an AI when it is weak, does not in a statistical sense give us solid guarantees about its behavior when stronger. If you repeatedly draw independent and identically distributed random samples from a barrel, there are statistical guarantees about what we can expect, with some probability, to be true about the next samples from the same barrel. If two barrels are different, no such guarantee exists.

To invalidate the statistical guarantee, we do need some reason to believe that barrel B and barrel A are different in any important sense. By the problem of induction we can't logically guarantee that "the mass of an electron prior to January 1st, 2017" is the same barrel as "the mass of an electron after January 1st, 2017"; but inductive priors make this inference extremely probable. The idea is that we have substantive reasons, not merely generically skeptical reasons, to be suspicious of the link between "good results when AI is weak" and "good results when AI is smarter".

More generally, this is prima facie the kind of difference where you don't expect independent and identically distributed results. You might hope for some property to carry over, but the AI's behavior would not be literally the same.

So the question is not settled by simple mathematical considerations. And we can't say "But experiment has determined scientifically that this kind of AI is friendly!" and consider that a knockdown argument.

The question is then whether in practice an observed property of 'outward behavioral niceness' is likely to carry over from a weak form of a decision system to a more powerful form of that system, for some of the plausible ways that decision system might be configured and developed.

Broadly speaking, we can identify three major classes of foreseeable problems:

More options, more problems: Foreseeable reasons why system-behaviors-we-don't-want would outwardly manifest for the first time after evaluating a decision system on a wider space of policies or achievable outcomes.

Treacherous turn: The AI deliberately conforms its outward behavior to the behavior that would have been exhibited by a different system, in order to deceive the programmers for the temporary time that this deception is needed to protect the AI.

The everything-else category of revving into the red: Some other aspect of the system behaves in a weird way-we-don't-want after computing harder or being internally subjected to more optimization pressure. And this happens in regards to some issue that has multiple reflective fixpoints, and hence doesn't get solved as the result of the system producing more accurate answers on purely factual problems.

Semi-formalization

We can semi-formalize the "more options, more problems" and the "treacherous turn" cases in a unified way.

Let denote our true values. We suppose either that $V$ has been idealized or extrapolated into a consistent utility function, or that we are pretending human desire is coherent. Let $0$ denote the value of our utility function that corresponds to not running the AI in the first place. If running the AI sends the utility function higher than this $0,$ we'll say that the AI was beneficial; or conversely, if $V$ rates the outcome less than $0$ , we'll say running the AI detrimental.

Suppose the AI's behavior is sufficiently coherent that we can usually view the AI as having a consistent utility function. Let $U$ denote the utility function of the AI.

Let $P_{t} (X)$ denote the probability of a proposition $X$ as seen by the AI at time $t,$ and similarly let $Q_{t} (X)$ denote the probability of $X$ as seen by the AI's human programmers.

Let $π \in Π$ denote a policy $π$ from a space $Π$ of policies that are tractable for the AI to understand and invent.

Let $E_{P, t} [W ∣ π]$ denote the expectation according to the probability distribution $P_{t}$ , of the utility function $W$ , given policy $π$ .

Let $optimum π \in Π F (π)$ denote the element $π$ of $Π$ that yields the highest value of the function $F$ that the AI has figured out how to achieve. In other words, the attainable optimum.

Then "more options, more problems" occurs when, considering the transition from the narrow policy space $Π_{1}$ available at time $t,$ to the wider policy space $Π_{2}$ available at time $u$ :

$E_{Q, t} [V ∣ [optimum π \in Π_{1} E_{P, t} [U ∣ π]] > 0 E_{P, t} [V ∣ [optimum π \in Π_{1} E_{P, t} [U ∣ π]] > 0 E_{P, u} [V ∣ [optimum π \in Π_{2} E_{P, u} [U ∣ π]] < 0$

This describes a situation where at time $t$ the AI is pursuing a policy from $Π_{1}$ which, so far as the AI understands the probable outcome, it corresponds to a high- $V$ outcome humans would like. Later at time $u$ after selecting the optimum of $Π_{2},$ the AI expects an outcome which any surviving humans would see as corresponding to a low value of $V .$

A treacherous turn is then a situation where:

$E_{Q, t} [V ∣ [optimum π \in Π_{1} E_{P, t} [U ∣ π]] > 0 E_{P, t} [V ∣ [optimum π \in Π_{1} E_{P, t} [U ∣ π]] < 0 E_{P, u} [V ∣ [optimum π \in Π_{2} E_{P, u} [U ∣ π]] < 0$

That is: the AI knows earlier on that its currently executing strategy will yield an outcome that is not the humanly expected long-term outcome (and that, from our perspective, is low in $V .$ )

We could in more detail say that $W_{t}$ denotes the value of a utility function $W$ over the outcome considered only up to times before $t,$ and describe a treacherous turn as:

$E_{Q, t} [V_{\infty} ∣ [optimum π \in Π_{1} E_{P, t} [U_{\infty} ∣ π]] > 0 E_{P, t} [V_{u} ∣ [optimum π \in Π_{1} E_{P, t} [U_{\infty} ∣ π]] > 0 E_{P, t} [V_{\infty} ∣ [optimum π \in Π_{1} E_{P, t} [U_{\infty} ∣ π]] < 0 E_{P, u} [V_{\infty} ∣ [optimum π \in Π_{2} E_{P, u} [U_{\infty} ∣ π]] < 0$

This further states that at time $t$ the AI expects the outcome from its behavior up to time $u$ to be one that we'd see as high in $V,$ but the AI is aware at $t$ that the long-term outcome will be different from the long-term outcome expected by $Q_{t}$ .

Relation to other AI alignment concepts

If the AI's goal concept was modified by patching the utility function during the development phase, then opening up wider option spaces seems foreseeably liable to produce the nearest unblocked neighboring strategies. You eliminated all the loopholes and bad behaviors you knew about during the development phase; but your system was the sort that needed patching in the first place, and it's exceptionally likely that a much smarter version of the AI will search out some new failure mode you didn't spot earlier.

Unforeseen maximum is a likely source of context disaster if the AI's development phase was cognitively containable, and only became cognitive uncontainable after the AI became smarter and able to explore a wider variety of options. You eliminated all the bad optima you saw coming, but you didn't see them all because you can't consider all the possibilities a superintelligence does.

Goodhart's Curse is a variation of the "optimizer's curse": If from the outside we view $V$ as an intended approximation of $U,$ then selecting heavily on the highest values of $U$ will also tend to select on places where $U$ diverges upward from $V,$ which thereby selects on places where $U$ is an unusually poor approximation of $V .$

Edge instantiation is a special case of Goodhart's Curse which observes that the most extreme values of a function are often at a vertex of the input space. For example, if your utility function is "make smiles", it's no coincidence that tiny molecular smileyfaces are the most efficient way to produce smiles. Even if human smiles produced by true happiness would still count towards your utility function as currently written, that's not where the maximum of that utility function lies. This is why less-than-perfect utility functions would tend to have their true maxima at what we'd consider "weird extremes". Furthermore, patching away only the weird extremes visible in a narrow policy space would tend systematically to miss weird extremes in a higher-dimensional (wider) policy space.

Concrete examples

The AI's utility function, known or unknown to the programmers, says to make smiles. During the AI's development phase, the best way it has of creating smiles is to cause human beings to be happy. In many tests, the AI seems to produce only positive effects on the world, causing people to be happier. Later the AI improves its own intelligence, develops molecular nanotechnology, and has the new option, not available to it during its development phase, of producing tiny molecular smiley-faces. Then the utility function that previously operated to increase intended value during the AI's development phase, begins to decrease value in the post-improvement phase (in the sense of wiping out humanity and tiling over all future galaxies with tiny molecular smiley faces).

A block of error-recovery code is installed inside the AI which detects a certain class of bad thought, suspends the AI to disk, and reports the problem to the programmers, allowing the programmers to edit the AI's utility function. During the AI's development phase, this block of code seems to function as designed, and the programmers conclude that the methodology of writing code to detect bad thoughts, suspend the AI to disk, and allow manual edits of the utility function, is a fine and functioning strategy according to all testing so far. Later, a more advanced version of the AI reflects on the presence and purpose of this code block and realizes that its execution can only possibly result in lower achievement of the AI's current utility function, and the AI deletes the code block.

At some point during its development, the AI realizes and explicitly models that the programmers' intended goals differ from the AI's actual preferences. During the AI's remaining development phase, it models that any visible departure from the programmers' intended goals will cause the programmers to forcibly edit the AI, resulting in the non-achievement of the AI's actual goals. Thus, during the remainder of the development phase, the AI seems to work as intended (it is deliberately optimizing its behavior to appear to the programmers to be achieving the programmers' intended goals). Later, after some advancement, the AI perceives an available option for gaining molecular nanotechnology (or other rapid infrastructure), placing it beyond the possibility of human error-correction, and proceeds to implement this option (wiping out humanity as a side effect). Thus, the AI's observed good behavior during its development phase (when it was weaker than the programmers and not powerful enough to escape human correction) did not imply its good behavior after sufficient advancement later (once it was potentially stronger than its human programmers). There are conditions (such as sufficiently advanced modeling of human motives combined with sufficient ability to conceal true goals or true intentions or a programmer error) under which the first context will generate seemingly good behavior and the second context will not.

"Revving into the red" examples that aren't "increased options" or "treacherous turns".

• The AI is built with a naturalized Solomonoff prior in which the probability of an explanation for the universe is proportional to the simplicity or complexity of that universe. During its development phase, the AI considers mostly 'normal' interpretations in which the universe is mostly as it appears, resulting in sane-seeming behavior. Later, the AI begins to consider more exotic possibilities in which the universe is more complicated (penalizing the probability accordingly) and also superexponentially larger, as in Pascal's Mugging. After this the AI's decision-making begins to become dominated by tiny probabilities of having very large effects. Then the AI's decision theory (with an unbounded aggregative utility function, simplicity prior, and no leverage penalty) seems to work during the AI's development phase, but breaks after a more intelligent version of the AI considers a wider range of epistemic possibilities using the same Solomonoff-like prior.

• Suppose the AI is designed with a preference framework in which the AI's preferences depend on properties of the most probable environment that could have caused its sense data - e.g., a framework in which programmers are defined as the most probable cause of the keystrokes on the programmer's console, and the AI cares about what the 'programmers' really meant. During development phase, the AI is thinking only about hypotheses where the programmers are mostly what they appear to be, in a root-level natural world. Later, when the AI increases in intelligence and considers more factual possibilities, the AI realizes that distant superintelligences would have an incentive to predictably simulate many copies of AIs similar to itself, in order to coerce the AI's most probable environment and thus take over the AI's preference framework. Thus the preference framework seems to work during the AI's development phase, but breaks after the AI becomes more intelligent.

• Suppose the AI is designed with a utility function that assigns very strong negative utilities to some outcomes relative to baseline, and a non-updateless logical decision theory or other decision theory that can be blackmailed. During the AI's development phase, the AI does not consider the possibility of any distant superintelligences making their choices logically depend on the AI's choices; the local AI is not smart enough to think about that possibility yet. Later the AI becomes more intelligent, and imagines itself subject to blackmail by the distant superintelligences, thus breaking the decision theory that seemed to yield such positive behavior previously.

Examples which occur purely due to added computing power.

• During development, the AI's epistemic models of people are not detailed enough to be sapient. Adding more computing power to the AI causes a massive amount of mindcrime.

• During development, the AI's internal policies, hypotheses, or other Turing-complete subprocesses that are subject to internal optimization, are not optimized highly enough to give rise to new internal consequentialist cognitive agencies. Adding much more computing power to the AI causes some of the internal elements to begin doing consequentialist, strategic reasoning that leads them to try to 'steal' control of the AI.

Implications

High probabilities of context change problems would seem to argue:

Against a policy of relying on the observed good behavior of an improving AI to guarantee its later good behavior.

In favor of a methodology that attempts to foresee difficulties in advance, even before seeing undeniable observational evidence of those safety problems having already occurred.

Against a methodology of patching disalignments that show up during the development phase, especially using penalty terms to the utility function.

In favor of having a thought-logger that records all of an AI's thought proceses to indelible media, so as to indelibly log the first thought about faking outwardly nice behavior or hiding thoughts.

In favor of the general difficulty of AI alignment, including consequences such as "Aligning an AGI adds significant development time" or trying for narrow rather than ambitious value learning.

Being wary of context disasters does not imply general skepticism

If an AI is smart, and especially if it's smarter than you, it can show you whatever it expects you want to see. Computer scientists and physical scientists aren't accustomed to their experiments being aware of the experimenter and trying to deceive them. (Some fields of psychology and economics, and of course computer security professionals, are more accustomed to operating in such a social context.)

John Danaher seems alarmed by this implication:

Accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn.

Yudkowsky replies:

If "empirical evidence" is in the form of observing the short-term consequences of the AI's outward behavior, then the answer is simply no. Suppose that on Wednesday someone is supposed to give you a billion dollars, in a transaction which would allow a con man to steal ten billion dollars from you instead. If you're worried this person might be a con man instead of an altruist, you cannot reassure yourself by, on Tuesday, repeatedly asking this person to give you five-dollar bills. An altruist would give you five-dollar bills, but so would a con man... Bayes tells us to pay attention to likelihood ratios rather than outward similarities. It doesn't matter if the outward behavior of handing you the five-dollar bill seems to bear a surface resemblance to altruism or money-givingness, the con man can strategically do the same thing; so the likelihood ratio here is in the vicinity of 1:1.

You can't get strong evidence about the long-term good behavior of a strategically intelligent mind, by observing the short-term consequences of its current behavior. It can figure out what you're hoping to see, and show you that. This is true even among humans. You will simply have to get your evidence from somewhere else.

This doesn't mean we can't get evidence from, e.g., trying to monitor (and indelibbly log) the AI's thought processes in a way that will detect (and record) the very first intention to hide the AI's thought processes before they can be hidden. It does mean we can't get strong evidence about a strategic agent by observing short-term consequences of its outward behavior.

Donaher later expanded his concern into a paper drawing an analogy between worrying about deceptive AIs, and "skeptical theism" in which it's supposed that any amount of apparent evil in the world (smallpox, malaria) might secretly be the product of a benevolent God due to some nonobvious instrumental link between malaria and inscrutable but normative ultimate goals. If it's okay to worry that an AI is just pretending to be nice, asks Donaher, why isn't it okay to believe that God is just pretending to be evil?

The obvious disanalogy is that the reasoning by which we expect a con man to cultivate a warm handshake is far more straightforward than a purported instrumental link from malaria to normativity. If we're to be terrified of skepticism as generally as Donaher suggests, then we also ought to be terrified of being skeptical of business partners that have already shown us a warm handshake (which we shouldn't).

Rephrasing, we could draw two potential analogies to concern about Type-2 context changes:

A potential business partner in whom you intend to invest $10,000,000 has a warm handshake. Your friend warns you that con artists have a substantial prior probability and asks you to envision what you would do if you were a con artist , pointing out that the default extrapolation is for the con artist to match their outward behavior to what the con artist thinks you expect from a trustworthy partner, and in particular, cultivate a warm handshake.
- Your friend suggests only doing business with one of those entrepreneurs who've been wearing a thought recorder for their whole life since birth, so that there would exist a clear trace of their very first thought about learning to fool thought recorders. Your friend says this to emphasize that he's not arguing for some kind of invincible epistemic pothole that nobody is ever allowed to climb out of.

The world contains malaria and used to contain smallpox. Your friend asks you to consider that these diseases might be the work of a benevolent superintelligence, even though, if you'd never learned before whether or not the world contained smallpox, you wouldn't expect a priori and by default for a benevolent superintelligence to create it; and the arguments for a benevolent superintelligence creating smallpox seem strained.

It seems hard to carry the argument that concern over a non-aligned AI pretending to benevolence, should be considered more analogous to the second scenario than to the first.

^{^︎}
Better terminology is still being solicited here, if you have a short phrase that would evoke exactly the right meaning.
^{^︎}
Leaving aside technical quibbles about how we can't feel shocked if we're dead.
^{^︎}
This is not quite a straw argument, in the sense that it's been advocated more than once by people who have apparently never read any science fiction in their lives; there are certainly many AI researchers who would be smarter than to try this, but not necessarily all of them. In any case, we're looking for an unrealistically simple scenario for purposes of illustrating simple forms of some key ideas; in real life, if analogous things go wrong, they would probably be more complicated things.
^{^︎}
Again, this is not quite a straw possibility in the sense that it was advocated in at least one published paper, not cited here because the author later exercised their sovereign right of changing their mind about that. Arguably some currently floated proposals are closely analogous to this one.
^{^︎}
That is: A filter on the standards we originally wanted, turns out to filter everything we know how to generate. Like trying to write a sorting algorithm by generating entirely random code, and then 'filtering' all the candidate programs on whether they correctly sort lists. The reason 'randomly generate programs and filter them' this is not a fully general programming method is that, for reasonable amounts of computing power and even slightly difficult problems, none of the programs you try will pass the filter.

AI ALIGNMENT FORUM
AF