Comments: The following is a list (very lightly edited with help from Rob Bensinger) I wrote in July 2017, at Nick Beckstead’s request, as part of a conversation we were having at the time. From my current vantage point, it strikes me as narrow and obviously generated by one person, listing the first things that came to mind on a particular day.

I worry that it’s easy to read the list below as saying that this narrow slice, all clustered in one portion of the neighborhood, is a very big slice of the space of possible ways an AGI group may have to burn down its lead.

This is one of my models for how people wind up with really weird pictures of MIRI beliefs. I generate three examples that are clustered together because I'm bad at generating varied examples on the fly, while hoping that people can generalize to see the broader space these are sampled from; then people think I’ve got a fetish for the particular corner of the space spanned by the first few ideas that popped into my head. E.g., they infer that I must have a bunch of other weird beliefs that force reality into that particular corner.

I also worry that the list below doesn’t come with a sufficiently loud disclaimer about how the real issue is earlier and more embarrassing. The real difficulty isn't that you make an AI and find that it's mostly easy to align except that it happens to befall issues b, d, and g. The thing to expect is more like: you just have this big pile of tensors, and the interpretability tools you've managed to scrounge together give you flashes of visualizations of its shallow thoughts, and the thoughts say “yep, I’m trying to kill all humans”, and you are just utterly helpless to do anything about, because you don't have the sort of mastery of its cognition that you'd need to reach in and fix that and you wouldn't know how to fix it if you did. And you have nothing to train against, except the tool that gives you flashes of visualizations (which would just train fairly directly against interpretability, until it was thinking about how to kill all humans somewhere that you couldn't see).

The brainstormed list below is an exercise in how, if you zoom in on any part of the problem, reality is just allowed to say “lol nope” to you from many different angles simultaneously. It's intended to convey some of the difference (that every computer programmer knows) between "I can just code X" and "wow, there is a lot of subtlety to getting X right"; the difference between the optimistic hope in-advance that everything is going to go smoothly, and the excessively detailed tarpit of reality. This is not to be confused with thinking that these hurdles are a particularly representative sample, much less an attempt to be exhaustive.



The imaginary group DeepAI pushed to get an AGI system as fast as reasonably possible. They now more or less understand how to build something that is very good at generalized learning and cross-domain reasoning and what-not. They rightfully believe that, if they had a reckless desire to increase the capabilities of the system as fast as possible without regard for the consequences, they would be able to have it recursively self-improving within a year. However, their existing system is not yet a superintelligence, and does not yet have the resources to be dangerous in its own right.

For the sake of concreteness, we will imagine that the system came largely from an extension of modern AI techniques: a large amount of end-to-end training, heavy use of neural networks, heavy use of reinforcement learning, and so on.

The question is, what sorts of things might they discover about the system that force them to stop and redesign (and/or recode, and/or retrain) large parts of the system?


Brainstorm list

(Note: Bullet points are highly disjunctive. Also, I’m leaning on the side of telling evocative stories so as to increase the chance of getting the point across; obviously, each specific detail is burdensome, and in each case I’m trying to wave in the direction of a more general class of possible failures. Also, to state the obvious, this list does not feel complete to me, and I find some of these points to be more plausible than others.)


  • (a) They want to put in alarms that warn them when the system is thinking a class of thought that they don’t want thought, but…
    • the system’s analog of “thought processes” are not amenable to programmatic classification, because…
      • the “thoughts” are so opaque that the programmers cannot figure them out for quite some time.
      • the representation / data structure is convoluted, and simple classification systems can’t figure it out (in the same way that a modern narrow AI system can understand sentiment but not content of a science paper).
      • the “thoughts” are not centralized; they arise out of interactions between many scattered parts of the system and an extensive redesign is required to make it possible to collate them and expose them to automated tools.
      • the system has internal control of its own “thought language”, and it changes rapidly enough that narrower automated tools can’t keep up; there is no easy way to slow down the shift to its internal thought-speak without crippling it.
    • the system simply wasn’t designed for monitoring of this form, and…
      • the code must be heavily refactored in order to even allow the relevant data about the system’s thoughts to be collected in a useful fashion.
      • the code must be heavily refactored in order to allow live monitors and checks to be attached in a way that do not cause an intolerable slowdown.


  • (b) They want to blacklist some domain of reasoning (either for alignment reasons or because the system is getting confused by irrelevant reasoning that they want to cut out); or they want to whitelist a set of reasoning domains; and the system simply was not designed to allow this.
    • Simple attempts to blacklist a domain result in nearest-unblocked-strategy problems. Solving the problem at the root requires re-architecting the system and a significant amount of retraining.
    • More sophisticated attempts to blacklist a single domain cripple the entire system. For example, it isn’t supposed to think about ways to deceive humans, and this destroys its ability to ask clarifying questions of the programmers.
    • Or, worse, the system is such a mess of spaghetti that when you try to prevent it from thinking too hard about geopolitics, for indecipherable reasons, it stops being able to think at all. (Later it was discovered that some crucial part of the system was figuring out how to manage some crucial internal resource by having some other part of the system think about hypothetical "geopolitics" questions, because what did you expect, your AGI’s internals are a mess.)


  • (c) The operators realize that the system’s internal objectives are not lining up with their own objectives. This is very difficult for them to fix, because…
    • the system achieved its high performance by being walked through a large number of objectives in heavily reward-landscaped environments (generated by large amounts of data). The system now has the world-models and the capabilities to pursue ambitious real-world objectives, but the only interface that the programmers have by which to point at an objective is via reward-landscaped objective functions generated by mountains of data. This is no longer sufficient, because…
      • the tasks at hand are not amenable to the generation of large amounts of data (e.g., we can’t generate a nicely landscaped reward function between here and “nanofabricator”, and we don’t have many examples of not-quite-nanofabricators to provide). The show is stopped.
      • the system has no interface through which the programmers can sift through the concepts in its world-model and pick out (or create, in something sufficiently close to the system’s native tongue for this to be fine) the concept corresponding to “nanofabricator”. Exposing that interface requires significant refactoring and some redesign.
      • the system’s concept language is opaque, and the programmers keep picking out something that’s not quite the nanofabricator concept, and the system keeps going down wrong paths. Developing translation tools for the system’s internal concept language requires significant time and effort. 
      • the internal concept language is constantly in flux; causing it to stay fixed long enough for the programmers to pick out a goal requires significant effort.
      • the programmers have no mechanism for tying a selected concept into the system’s main optimization procedures. The system is very good at optimizing in a way that causes rewards (or whatever the analog of reward it was trained on) to be high, but given a pointer to a certain concept in the system, it is not apparent how to design a reward landscape that makes the system optimize for a chosen concept.

        And this is exacerbated by the fact that the system has no centralized optimization procedure; it instead has a large collection of internal processes that interact in a way that causes the predicted rewards to be high, but it is very difficult to identify and understand all those internal processes sufficiently well to get them all pointed at something other than optimizing in favor of the reward channel.

        Their attempts keep failing because, e.g., subsystem X had a heuristic to put its outputs in location Y, which is where subsystem Z would have been looking for them if subsystem Z had been optimizing the reward channel, but optimization of some other arbitrary concept causes Z’s “look in location Y” heuristic to become invalidated for one reason or another, and that connection stops occurring. And so on and so forth; aligning all the internal subprocesses to pursue something other than the reward channel proves highly difficult.
    • the system is having a particularly hard time learning the boundaries of the human concept: its empirically-motivated internal language does not beget short descriptions of value-laden objectives. Significant retraining is required to develop a language in which it can even develop the concept of the goal.


  • (d) In order to get the system to zero in on the operators’ goals, they decide to have the system ask the humans various questions at certain key junctures. This proves more difficult than expected, because…
    • the system wasn’t designed to allow this, and it’s pretty hard to add all the right hooks (for similar reasons to why it might be difficult to add alarms).
    • the system vacillates between asking far too many and far too few questions, and a lot of thought and some redesign/retraining is necessary in order to get the question-asking system to the point where the programmers think it might actually provide the desired safety coverage.
    • the system does not yet have an understanding of human psychology sufficient for it to be able to ask the right questions in value-laden domains, and significant time is wasted trying to make this work when it can’t.
    • relatedly, the system is not yet smart enough to generalize over the human answers in a reasonable fashion, causing it to gain far less from the answers than humans think it should, and solving this would require ramping up the system’s capabilities to an unsafe level.
    • the system has no mechanism for translating its more complex / complicated / subtle questions into questions that humans can understand and provide reasonable feedback on. Fixing this requires many months of effort, because…
      • understanding the questions well enough to even figure out how to translate them is hard.
      • building the translation tool is hard.
      • the system is bad at describing the likely consequences of its actions in human-comprehensible terms. Fixing this is hard for, e.g., reasons discussed under (c).


  • (e) The early system is highly goal-directed through and through, and the developers want to switch to something more like “approval direction all the way down”. This requires a large and time-intensive refactor (if it's even reasonably-possible at all).


  • (f) Or, conversely, the system starts out a mess, and the developers want to switch to a “goal directed all the way down” system, where every single computation in the system is happening for a known purpose (and some other system is monitoring and making sure that every subprocess is pursuing a particular narrow purpose). Making this possible requires a time-intensive refactor.


  • (g) The programmers want to remove all “argmaxing” (cases of unlimited optimization inside the system, such as “just optimize the memory efficiency as hard as possible”). They find this very difficult for reasons discussed above (the sources of argmaxing behavior are difficult to identify; limiting an argmax in one part of the system breaks some other far-flung part of the system for difficult-to-decipher reasons; etc. etc. etc.).


  • (h) The programmers want to track how much resource the system is putting towards various different internal subgoals, but this is difficult for reasons discussed above, etc.


  • (i) The programmers want to add any number of other safety features (limited impact, tripwires, etc.) and find this difficult for reasons listed above, etc.


  • (j) The internal dynamics of the system are revealed to implement any one of a bajillion false dichotomies, such as “the system can either develop reasonable beliefs about X, or pursue goal Y, but the more we improve its beliefs about X the worse it gets at pursuing Y, and vice versa.” (There are certainly human cases in human psychology where better knowledge of fact X makes the human less able to pursue goal Y, and this seems largely silly.)


  • (k) Generalizing over a number of points that appeared above, the programmers realize that they need to make the system broadly more…
    • transparent. Its concepts/thought patterns are opaque black boxes. They’ve burned time understanding specific types of thought patterns in many specific instances, and now they have some experience with the system, and want to refactor/redesign/retrain such that it’s more transparent across the board. This requires a number of months.
    • debuggable. Its internals are interdependent spaghetti, where (e.g.) manually modifying a thought-suggesting system to add basic alarm systems violates assumptions that some other far-flung part of the system was depending on; this is a pain in the ass to debug. After a number of these issues arise, the programmers decide that they cannot safely proceed until they…
      • cleanly separate various submodules by hand, and to hell with end-to-end training. This takes many months of effort.
      • retrain the system end-to-end in a way that causes its internals to be more modular and separable. This takes many months of effort.


  • (l) Problems crop up when they try to increase the capabilities of the system. In particular, the system…
    • finds new clever ways to wirehead.
    • starts finding “epistemic feedback loops” such as the Santa clause sentence (“If this sentence is true, then Santa Claus exists”) that, given it’s internally hacky (and not completely sound) reasoning style, allow it to come to any conclusion if it thinks the right thoughts in the right pattern.
    • is revealed to have undesirable basic drives (such as a basic drive for efficient usage of memory chips), in a fashion similar to how humans have a basic drive for hunger, in a manner that affects its real-world policy suggestions in a sizable manner. While the programmers have alarms that notice this and go off, it is very deep-rooted and particularly difficult to remove or ameliorate without destroying the internal balance that causes the system to work at all.
      • The system develops a reflective instability. For example, the system previously managed its internal resources by spawning internal goals for things like scheduling and prioritization, and as the system scales and gets new, higher-level concepts, it regularly spawns internal goals for large-scale self-modifications which it would not be safe to allow. However, preventing these proves quite difficult, because…
        • detecting them is tough.
        • manually messing with the internal goal system breaks everything.
        • nearest-unblocked-strategy problems.
      • It realizes that it has strong incentives to outsource its compute into the external environment. Removing this is difficult for reasons discussed above.
      • Subprocesses that were in delicate balance at capability level X fall out of balance as capabilities are increased, and a single module begins to dominate the entire system.
        • For example, maybe the system uses some sort of internal market economy for allocating credit, and as the resources ramp up, certain cliques start to get a massive concentration of “wealth” that causes the whole system to gum up, and this is difficult to understand / debug / fix because the whole thing was so delicate in the first place.


  • (m) The system is revealed to have any one of a bajillion cognitive biases often found in humans, and it’s very difficult to track down why or to fix it, but the cognitive bias is sufficient to make the system undeployable.
    • Example: it commits a variant of the sour grapes fallacy where whenever it realizes that a goal is difficult it updates both its model of the world and its preferences about how good it would be to achieve that goal; this is very difficult to patch because the parts of the system that apply updates based on observation were end-to-end trained, and do not factor nicely along “probability vs utility” lines.


  • (n) The system can be used to address various issues of this form, but only by giving it the ability to execute unrestricted self-modification. The extent, rapidity, or opacity of the self-modifications are such that humans cannot feasibly review them. The design of the system does not allow the programmers to easily restrict the domain of these self-modifications such that they can be confident that they will be safe. Redesigning the system such that it can fix various issues in itself without giving it the ability to undergo full recursive self-improvement requires significant redesign and retraining.


  • (o) As the team is working to get the system deployment-ready for some pivotal action, the system’s reasoning is revealed to be corrupted by flaws in some very base-level concepts. The system requires significant retraining time and some massaging on the code/design levels in order to change these concepts and propagate some giant updates; this takes a large chunk of time.


  • (p) The system is very easy to fool, trick, blackmail, or confuse-into-revealing-all-its-secrets, or similar. The original plan that the operators were planning to pursue requires putting the system out in the environment where adversarial humans may attempt to take control of the system or otherwise shut it down. Hardening the system against this sort of attack requires many months of effort, including extensive redesign/retraining/recoding.


  • (q) The strategy that the operators were aiming for requires cognitive actions that the programmers eventually realize is untenable in the allotted time window or otherwise unsafe, such as deep psychological modeling of humans. The team eventually decides to choose a new pivotal action to target, and this new strategy requires a fair bit of redesign, recoding, and/or retraining.


  • My impression is that most catastrophic bugs in the space industry are not due to code crashes / failures; they are instead due to a normally-reliable module producing a wrong-but-syntactically-close-to-right valid-seeming output at an inopportune time. It seems very plausible to me that first-pass AGI systems will be in the category of things that work via dividing labor across a whole bunch of interoperating internal modules; insofar as errors can cascade when a normally-reliable module outputs a wrong-but-right-seeming output at the wrong time, I think we do in fact need to treat “getting the AGI’s internals right” as being in the same reference class as “get the space probe’s code right”.
  • Note, as always, that detecting the problem is only half the battle – in all the cases above, I’m not trying to point and say “people might forget to check this and end the world”; rather, I’m saying, “once this sort of error is detected, I expect that the team will need to burn a chunk of time to correct it”.
  • Recall that this is a domain where playing whack-a-mole gets you killed: if you have very good problem-detectors, and you go around removing problem symptoms instead of solving the underlying root problem, then eventually your problem-detectors will stop going off, but this will not be because your AGI is safe to run. In software, removing the symptoms is usually way easier than fixing a problem at the root cause; I worry that fixing these sorts of problems at their root cause can require quite a bit of time.
  • Recall that it’s far harder to add a feature to twitter than it is to add the same feature to a minimalistic twitter clone that you banged out in an afternoon. Similarly, solving an ML problem in a fledgling AGI in a way that integrates with the rest of the system without breaking anything delicate is likely way harder than solving an analogous ML problem in a simplified setting from a clean slate.

Finally, note that this is only intended as a brainstorm of things that might force a leading team to burn a large number of months; it is not intended to be an exhaustive list of reasons that alignment is hard. (That would include various other factors such as “what sorts of easy temptations will be available that the team has to avoid?” and “how hard is it to find a viable deployment strategy?” and so on.)

New Comment
5 comments, sorted by Click to highlight new comments since:

Some added context for this list: Nate and Eliezer expect the first AGI developers to encounter many difficulties in the “something forces you to stop and redesign (and/or recode, and/or retrain) large parts of the system” category, with the result that alignment adds significant development time.

By default, safety-conscious groups won't be able to stabilize the game board before less safety-conscious groups race ahead and destroy the world. To avoid this outcome, humanity needs there to exist an AGI group that

  • is highly safety-conscious.
  • has a large resource advantage over the other groups, so that it can hope to reach AGI with more than a year of lead time — including accumulated capabilities ideas and approaches that it hasn’t been publishing.
  • has adequate closure and opsec practices, so that it doesn’t immediately lose its technical lead if it successfully acquires one.

The magnitude and variety of difficulties that are likely to arise in aligning the first AGI systems also suggests that failure is very likely in trying to align systems as opaque as current SotA systems; and suggests an AGI developer likely needs to have spent preceding years deliberately steering toward approaches to AGI that are relatively alignable; and it suggests that we need to up our game in general, approaching the problem in ways that are closer to the engineering norms at (for example) NASA, than to the engineering norms that are standard in ML today.

Every time this post says “To accomplish X, the code must be refactored”, I would say more pessimistically “To accomplish X, maybe the code must be refactored, OR, even worse, maybe nobody on the team knows has any viable plan for how to accomplish X at all, and the team burns its lead doing a series of brainstorming sessions or whatever.”

Strong upvoted. I appreciate the strong concreteness & focus on internal mechanisms of cognition.

Rereading this now -- do you still endorse this?

On an extremely brief skim, I do appreciate the concreteness still. I think it's very off-target in thinking about "what are the goals?", because I think that's not a great abstraction for what we're likely to get.