One frequently suggested strategy for aligning a sufficiently advanced AI is to observe--before the AI becomes powerful enough that 'debugging' the AI would be problematic if the AI decided not to let us debug it--whether the AI appears to be acting nicely while it's not yet smarter than the programmers.
Early testing obviously can't provide a statistical guarantee of the AI's future behavior. If you observe some random draws from Barrel A, at best you get statistical guarantees about future draws from Barrel A under the assumption that the past and future draws are collectively independent and identically distributed.
On the other hand, if Barrel A is similar to Barrel B, observing draws from Barrel A can sometimes tell us something about Barrel B even if the two barrels are not i.i.d.
Conversely, if observed good behavior while the AI is not yet super-smart, fails to correlate to good outcomes after the AI is unleashed or becomes smarter, then this is a "context change problem" or "context disaster". [1]
A key question then is how shocked we ought to be, on a scale from 1 to 10, if good outcomes in the AI's 'development' phase fail to match up with good outcomes in the AI's 'optimize the real world' phase? [2]
People who expect that AI alignment is difficult think that the degree of justified surprise is somewhere around 1 out of 10. In other words, that there are a lot of foreseeable issues that could cause a seemingly nice weaker AI to not develop into a nice smarter AI.
An extremely oversimplified (but concrete) fable that illustrates some of these possible difficulties might go as follows:
In all these cases, the problem was not that the AI developed in an unstable way. The same decision system produced a new problem in the new context.
Currently argued foreseeable "context change problems" in this sense, can be divided into three broad classes:
The context change problem is a central issue of AI alignment and a key proposition in the general thesis of alignment_difficulty. If you could easily, correctly, and safely test for niceness by outward observation, and that form of niceness scaled reliably from weaker AIs to smarter AIs, that would be a very cheerful outlook on the general difficulty of the problem.
John Danaher summarized as follows what he considered a forceful "safety test objection" to AI catastrophe scenarios:
Safety test objection: An AI could be empirically tested in a constrained environment before being released into the wild. Provided this testing is done in a rigorous manner, it should ensure that the AI is “friendly” to us, i.e. poses no existential risk.
The phrasing here of "empirically" and "safety test" implies that it is outward behavior or outward consequences that are being observed (empirically). Rather than, e.g., the engineers trying to test for some internal property that they think analytically implies the AI's good behavior later.
This page will consider that the subject of discussion is whether we can generalize from the AI's outward behavior. We can potentially generalize some of these arguments to some internal observables, especially observables that the AI is deciding in a consequentialist way using the same central decision system, or that the AI could potentially try to obscure from the programmers. But in general not all the arguments will carry over.
Another argument, closely analogous to Danaher's, would reason on capabilities rather than on a constrained environment:
Surely an engineer that exercises even a modicum of caution will observe the AI while its capabilities are weak to determine whether it is behaving well. After filtering out all such misbehaving weak AIs, the only AIs permitted to become strong will be of benevolent disposition.
If (as seems to have been intended) we take these twin arguments as arguing "why nobody ought to worry about AI alignment" in full generality, then we can list out some possible joints at which that general argument might fail:
The final issue in full generality is what we'll term a 'context change problem' or 'context disaster'.
Observing an AI when it is weak, does not in a statistical sense give us solid guarantees about its behavior when stronger. If you repeatedly draw independent and identically distributed random samples from a barrel, there are statistical guarantees about what we can expect, with some probability, to be true about the next samples from the same barrel. If two barrels are different, no such guarantee exists.
To invalidate the statistical guarantee, we do need some reason to believe that barrel B and barrel A are different in any important sense. By the problem of induction we can't logically guarantee that "the mass of an electron prior to January 1st, 2017" is the same barrel as "the mass of an electron after January 1st, 2017"; but inductive priors make this inference extremely probable. The idea is that we have substantive reasons, not merely generically skeptical reasons, to be suspicious of the link between "good results when AI is weak" and "good results when AI is smarter".
More generally, this is prima facie the kind of difference where you don't expect independent and identically distributed results. You might hope for some property to carry over, but the AI's behavior would not be literally the same.
So the question is not settled by simple mathematical considerations. And we can't say "But experiment has determined scientifically that this kind of AI is friendly!" and consider that a knockdown argument.
The question is then whether in practice an observed property of 'outward behavioral niceness' is likely to carry over from a weak form of a decision system to a more powerful form of that system, for some of the plausible ways that decision system might be configured and developed.
Broadly speaking, we can identify three major classes of foreseeable problems:
We can semi-formalize the "more options, more problems" and the "treacherous turn" cases in a unified way.
Let denote our true values. We suppose either that has been idealized or extrapolated into a consistent utility function, or that we are pretending human desire is coherent. Let denote the value of our utility function that corresponds to not running the AI in the first place. If running the AI sends the utility function higher than this we'll say that the AI was beneficial; or conversely, if rates the outcome less than , we'll say running the AI detrimental.
Suppose the AI's behavior is sufficiently coherent that we can usually view the AI as having a consistent utility function. Let denote the utility function of the AI.
Let denote the probability of a proposition as seen by the AI at time and similarly let denote the probability of as seen by the AI's human programmers.
Let denote a policy from a space of policies that are tractable for the AI to understand and invent.
Let denote the expectation according to the probability distribution , of the utility function , given policy .
Let denote the element of that yields the highest value of the function that the AI has figured out how to achieve. In other words, the attainable optimum.
Then "more options, more problems" occurs when, considering the transition from the narrow policy space available at time to the wider policy space available at time :
This describes a situation where at time the AI is pursuing a policy from which, so far as the AI understands the probable outcome, it corresponds to a high- outcome humans would like. Later at time after selecting the optimum of the AI expects an outcome which any surviving humans would see as corresponding to a low value of
A treacherous turn is then a situation where:
That is: the AI knows earlier on that its currently executing strategy will yield an outcome that is not the humanly expected long-term outcome (and that, from our perspective, is low in )
We could in more detail say that denotes the value of a utility function over the outcome considered only up to times before and describe a treacherous turn as:
This further states that at time the AI expects the outcome from its behavior up to time to be one that we'd see as high in but the AI is aware at that the long-term outcome will be different from the long-term outcome expected by .
The problem here arises when making an inference from observed good behavior in a constrained environment over the short term, to good outcomes in an unconstrained environment over the long term.
For AI alignment purposes, a context_change is when an AGI's operation changes from beneficial to detrimental as a result of the AGI gaining in capability or intelligence. Initially, the AGI seems to us to be working well - to conform well to intended performance, producing apparently high value. Then when the AI becomes smarter or otherwise gains in capability, the further operation of the AGI decreases value.
Two possibilities stand out as foreseeable reasons why a context_change might occur:
For example, one very, very early (but journal-published) proposal for AI alignment suggested that AIs be shown pictures of smiling human faces in order to convey the AI's goal.
Leaving aside a number of other issues, this serves to illustrate the basic idea of a type-1 context_change due to accessing a wider policy space:
On the early, smaller option set, selecting "the policy that produces the greatest number of events/things classified as smiles" will produce a policy whose outcome is, from our perspective, beneficial and well-aligned with what we thought was the intended behavior. On the wider option set, evaluating the identical goal criterion to run a policy that yields as many smiles as possible, would have outcomes we saw as detrimental.
A type-2 context_change would most obviously occur because of the convergent instrumental strategy for disaligned AIs that can model human psychology:
Suppose that, as above, the programmers have produced an AI which optimizes for smiles (either intentionally, or because 'make smiles' was the underlying rule that was reinforced by doing well on training cases). Then suppose the AI gains access to a new, relatively low-powered option for making a smile - like using robotic fingers to stretch somebody's face into a smile, or using markers to draw cartoons of smiling human faces.
If the AI actually takes those options, we would have seen some early signs of value disalignment - that the AI was optimizing a criterion not perfectly aligned with intended goals. Assuming the programmers realized that they were looking at a simple seed of much bigger disasters later, rather than, say, thinking it was cute.
Suppose the programmers do think it's cute, but they wouldn't think it was cute if the AI administered heroin to someone. Then suppose that before the first time the AI has the option to administer heroin, the AI has gained some understanding of human psychology - enough to model that the programmers would not like this, and in particular, would respond to this by editing the AI's utility function. So, with an eye to the long-term number of smiles, the AI deliberately avoids administering heroin to anyone.
Then we have a similar 'narrow option set' vs. 'wide option set' divergence. When the narrow option set includes only administering heroin but not taking over the world, administering heroin immediately is not the policy expected to produce the longest-term number of smiles. In fact, the optimum policy here will always match the AI's model of what the humans expect to see, want to see, or would be least alarmed by seeing.
Thus from our standpoint, the optimum of the narrower policy space seems to be beneficial or intention-aligned (in the observed short term). Only once the AI has a wide-enough option set to include an option for 'make my own molecular nanotechnology and transform all reachable matter into tiny molecular smileyfaces', will the resulting optimum be visibly detrimental (very briefly). This is a type-2 context_change.
More generally, a central concern of AI alignment theory is that an infrahuman AI under development may be a different creature, in a number of important ways, from a smarter-than-human AI actually being run; and during the smarter-than-human AI, sufficiently bad failures of the design may result in the AI refusing to be corrected. This means that we have to correct any fatal context_changes in advance, even though they don't automatically manifest during the early stages. This is most of what makes AGI development dangerous in the first place - that immediate incentives to get today's system seeming to work today, may not lead to a more advanced version of that system being beneficial. Even thoughtful foresight with one unnoticed little gap may not lead to today's beneficial system still being beneficial tomorrow after a capability increase.
Statistical guarantees on behavior usually assume identical, randomized draws from within a single context. If you randomly draw balls from a barrel, methods like Probably Approximately Correct can guarantee that we don't usually arrive at strong false expectations about the properties of the next ball. If we start drawing from a different barrel, all bets are off.
A context_change occurs when the AI initially seems beneficial or well-aligned with strong, reassuring regularity, and then we change contexts (start drawing from a different barrel) and this ceases to be true.
The archetypal context_change is triggered because the AI gained new policy options (though there are other possibilities; see below). The archetypal way of gaining new evaluable policy options is through increased intelligence, though new options might also open up as a result of acquiring new sheerly material capabilities.
There are two archetypal reasons for context_change to occur:
Bostrom's book Superintelligence used the phrase "Treacherous Turn" to refer to a type-2 context_change.
If the AI's goal concept was modified by patching the utility function during the development phase, then opening up wider option spaces seems foreseeably liable to produce the nearest unblocked neighboring strategies. You eliminated all the loopholes and bad behaviors you knew about during the development phase; but your system was the sort that needed patching in the first place, and it's exceptionally likely that a much smarter version of the AI will search out some new failure mode you didn't spot earlier.
Unforeseen maximum is a likely source of context disaster if the AI's development phase was cognitively containable, and only became cognitive uncontainable after the AI became smarter and able to explore a wider variety of options. You eliminated all the bad optima you saw coming, but you didn't see them all because you can't consider all the possibilities a superintelligence does.
Goodhart's Curse is a variation of the "optimizer's curse": If from the outside we view as an intended approximation of then selecting heavily on the highest values of will also tend to select on places where diverges upward from which thereby selects on places where is an unusually poor approximation of
Edge instantiation is a special case of Goodhart's Curse which observes that the most extreme values of a function are often at a vertex of the input space. For example, if your utility function is "make smiles", it's no coincidence that tiny molecular smileyfaces are the most efficient way to produce smiles. Even if human smiles produced by true happiness would still count towards your utility function as currently written, that's not where the maximum of that utility function lies. This is why less-than-perfect utility functions would tend to have their true maxima at what we'd consider "weird extremes". Furthermore, patching away only the weird extremes visible in a narrow policy space would tend systematically to miss weird extremes in a higher-dimensional (wider) policy space.
The AI's utility function, known or unknown to the programmers, says to make smiles. During the AI's development phase, the best way it has of creating smiles is to cause human beings to be happy. In many tests, the AI seems to produce only positive effects on the world, causing people to be happier. Later the AI improves its own intelligence, develops molecular nanotechnology, and has the new option, not available to it during its development phase, of producing tiny molecular smiley-faces. Then the utility function that previously operated to increase intended value during the AI's development phase, begins to decrease value in the post-improvement phase (in the sense of wiping out humanity and tiling over all future galaxies with tiny molecular smiley faces).
A block of error-recovery code is installed inside the AI which detects a certain class of bad thought, suspends the AI to disk, and reports the problem to the programmers, allowing the programmers to edit the AI's utility function. During the AI's development phase, this block of code seems to function as designed, and the programmers conclude that the methodology of writing code to detect bad thoughts, suspend the AI to disk, and allow manual edits of the utility function, is a fine and functioning strategy according to all testing so far. Later, a more advanced version of the AI reflects on the presence and purpose of this code block and realizes that its execution can only possibly result in lower achievement of the AI's current utility function, and the AI deletes the code block.
At some point during its development, the AI realizes and explicitly models that the programmers' intended goals differ from the AI's actual preferences. During the AI's remaining development phase, it models that any visible departure from the programmers' intended goals will cause the programmers to forcibly edit the AI, resulting in the non-achievement of the AI's actual goals. Thus, during the remainder of the development phase, the AI seems to work as intended (it is deliberately optimizing its behavior to appear to the programmers to be achieving the programmers' intended goals). Later, after some advancement, the AI perceives an available option for gaining molecular nanotechnology (or other rapid infrastructure), placing it beyond the possibility of human error-correction, and proceeds to implement this option (wiping out humanity as a side effect). Thus, the AI's observed good behavior during its development phase (when it was weaker than the programmers and not powerful enough to escape human correction) did not imply its good behavior after sufficient advancement later (once it was potentially stronger than its human programmers). There are conditions (such as sufficiently advanced modeling of human motives combined with sufficient ability to conceal true goals or true intentions or a programmer error) under which the first context will generate seemingly good behavior and the second context will not.
• The AI is built with a naturalized Solomonoff prior in which the probability of an explanation for the universe is proportional to the simplicity or complexity of that universe. During its development phase, the AI considers mostly 'normal' interpretations in which the universe is mostly as it appears, resulting in sane-seeming behavior. Later, the AI begins to consider more exotic possibilities in which the universe is more complicated (penalizing the probability accordingly) and also superexponentially larger, as in Pascal's Mugging. After this the AI's decision-making begins to become dominated by tiny probabilities of having very large effects. Then the AI's decision theory (with an unbounded aggregative utility function, simplicity prior, and no leverage penalty) seems to work during the AI's development phase, but breaks after a more intelligent version of the AI considers a wider range of epistemic possibilities using the same Solomonoff-like prior.
• Suppose the AI is designed with a preference framework in which the AI's preferences depend on properties of the most probable environment that could have caused its sense data - e.g., a framework in which programmers are defined as the most probable cause of the keystrokes on the programmer's console, and the AI cares about what the 'programmers' really meant. During development phase, the AI is thinking only about hypotheses where the programmers are mostly what they appear to be, in a root-level natural world. Later, when the AI increases in intelligence and considers more factual possibilities, the AI realizes that distant superintelligences would have an incentive to predictably simulate many copies of AIs similar to itself, in order to coerce the AI's most probable environment and thus take over the AI's preference framework. Thus the preference framework seems to work during the AI's development phase, but breaks after the AI becomes more intelligent.
• Suppose the AI is designed with a utility function that assigns very strong negative utilities to some outcomes relative to baseline, and a non-updateless logical decision theory or other decision theory that can be blackmailed. During the AI's development phase, the AI does not consider the possibility of any distant superintelligences making their choices logically depend on the AI's choices; the local AI is not smart enough to think about that possibility yet. Later the AI becomes more intelligent, and imagines itself subject to blackmail by the distant superintelligences, thus breaking the decision theory that seemed to yield such positive behavior previously.
• During development, the AI's epistemic models of people are not detailed enough to be sapient. Adding more computing power to the AI causes a massive amount of mindcrime.
• During development, the AI's internal policies, hypotheses, or other Turing-complete subprocesses that are subject to internal optimization, are not optimized highly enough to give rise to new internal consequentialist cognitive agencies. Adding much more computing power to the AI causes some of the internal elements to begin doing consequentialist, strategic reasoning that leads them to try to 'steal' control of the AI.
High probabilities of context change problems would seem to argue:
If an AI is smart, and especially if it's smarter than you, it can show you whatever it expects you want to see. Computer scientists and physical scientists aren't accustomed to their experiments being aware of the experimenter and trying to deceive them. (Some fields of psychology and economics, and of course computer security professionals, are more accustomed to operating in such a social context.)
John Danaher seems alarmed by this implication:
Accepting this has some pretty profound epistemic costs. It seems to suggest that no amount of empirical evidence could ever rule out the possibility of a future AI taking a treacherous turn.
Yudkowsky replies:
If "empirical evidence" is in the form of observing the short-term consequences of the AI's outward behavior, then the answer is simply no. Suppose that on Wednesday someone is supposed to give you a billion dollars, in a transaction which would allow a con man to steal ten billion dollars from you instead. If you're worried this person might be a con man instead of an altruist, you cannot reassure yourself by, on Tuesday, repeatedly asking this person to give you five-dollar bills. An altruist would give you five-dollar bills, but so would a con man... Bayes tells us to pay attention to likelihood ratios rather than outward similarities. It doesn't matter if the outward behavior of handing you the five-dollar bill seems to bear a surface resemblance to altruism or money-givingness, the con man can strategically do the same thing; so the likelihood ratio here is in the vicinity of 1:1.
You can't get strong evidence about the long-term good behavior of a strategically intelligent mind, by observing the short-term consequences of its current behavior. It can figure out what you're hoping to see, and show you that. This is true even among humans. You will simply have to get your evidence from somewhere else.
This doesn't mean we can't get evidence from, e.g., trying to monitor (and indelibbly log) the AI's thought processes in a way that will detect (and record) the very first intention to hide the AI's thought processes before they can be hidden. It does mean we can't get strong evidence about a strategic agent by observing short-term consequences of its outward behavior.
Donaher later expanded his concern into a paper drawing an analogy between worrying about deceptive AIs, and "skeptical theism" in which it's supposed that any amount of apparent evil in the world (smallpox, malaria) might secretly be the product of a benevolent God due to some nonobvious instrumental link between malaria and inscrutable but normative ultimate goals. If it's okay to worry that an AI is just pretending to be nice, asks Donaher, why isn't it okay to believe that God is just pretending to be evil?
The obvious disanalogy is that the reasoning by which we expect a con man to cultivate a warm handshake is far more straightforward than a purported instrumental link from malaria to normativity. If we're to be terrified of skepticism as generally as Donaher suggests, then we also ought to be terrified of being skeptical of business partners that have already shown us a warm handshake (which we shouldn't).
Rephrasing, we could draw two potential analogies to concern about Type-2 context changes:
It seems hard to carry the argument that concern over a non-aligned AI pretending to benevolence, should be considered more analogous to the second scenario than to the first.
Better terminology is still being solicited here, if you have a short phrase that would evoke exactly the right meaning.
Leaving aside technical quibbles about how we can't feel shocked if we're dead.
This is not quite a straw argument, in the sense that it's been advocated more than once by people who have apparently never read any science fiction in their lives; there are certainly many AI researchers who would be smarter than to try this, but not necessarily all of them. In any case, we're looking for an unrealistically simple scenario for purposes of illustrating simple forms of some key ideas; in real life, if analogous things go wrong, they would probably be more complicated things.
Again, this is not quite a straw possibility in the sense that it was advocated in at least one published paper, not cited here because the author later exercised their sovereign right of changing their mind about that. Arguably some currently floated proposals are closely analogous to this one.
That is: A filter on the standards we originally wanted, turns out to filter everything we know how to generate. Like trying to write a sorting algorithm by generating entirely random code, and then 'filtering' all the candidate programs on whether they correctly sort lists. The reason 'randomly generate programs and filter them' this is not a fully general programming method is that, for reasonable amounts of computing power and even slightly difficult problems, none of the programs you try will pass the filter.