This is a paper I wrote as part of a PhD program in philosophy, in trying to learn more about and pivot towards alignment research. In it, I mostly aimed to build up and distill my knowledge of IDA.

Special thanks to Daniel Kokotajlo for his mentorship on this, and to Michael Brownstein, Eric Schwitzgebel, Evan Hubinger, Mark Xu, William Saunders, and Aaron Gertler for helpful feedback!

Introduction

Iterated Amplification and Distillation (IDA) (Christiano, Shlegeris and Amodei 2018) is a research program in technical AI alignment theory (Bostrom 2003, 2014; Yudkowsky 2008; Russell 2019; Ngo 2020). It’s a proposal about how to build a machine learning algorithm that pursues human goals, so that we can safely count on very powerful, near-future AI systems pursuing the things we want them to once they are more capable than us.

IDA does this by building an epistemically idealized model of a particular human researcher. In doing so, it has to answer the question of what “epistemic idealization” means, exactly. IDA’s answer is one epistemically idealized version of someone is an arbitrarily large number of copies of them thinking in a particular research mindset, each copy thinking for a relatively short amount of time, all working together to totally explore a given question. Given questions are divided into all significant sub-questions by a hierarchy of researcher models, all relevant basic sub-questions are answered by researcher models, and then answers are composed into higher-level answers. In the end, a research model is able to see what all his relevant lines of thought regarding any question would be, were he to devote the time to thinking them through. Epistemic idealization means being able to see at a glance every relevant line of reasoning that you would think through, if you only had the time to.

One worry for IDA is that there might exist sub-questions researcher models in the hierarchy might encounter that would cause them to (radically) reconsider their goals. If not all models in the hierarchy share the same goal-set, then it’s no longer guaranteed that the hierarchy’s outputs are solely the product of goal-aligned reasoning. We now have to worry about portions of the hierarchy attempting to manipulate outcomes by selectively modifying their answers. (Take, for example, a question containing a sound proof that the researcher considering the question will be tortured forever if he doesn’t reply in X arbitrary way.) We risk the hierarchy encountering these questions because it aims to look over a huge range of relevant sub-questions, and because it might end up running subprocesses that would feed it manipulative questions. I argue that this is a real problem for IDA, but that appropriate architectural changes can address the problem.

I’ll argue for these claims by first explaining IDA, both intuitively and more technically. Then, I’ll examine the class of adversarial questions that could disrupt goal alignment in IDA’s hierarchy of models. Finally, I’ll explain the architectural modifications that resolve the adversarial questions problem.

The Infinite Researcher-Hierarchy

In IDA, the name of this potentially infinite hierarchy of research models is “HCH” (a self-containing acronym for “Humans Consulting HCH”) (Christiano 2016, 2018). It’s helpful to start with an intuitive illustration of HCH and turn to the machine learning (ML) details only afterwards, on a second pass. We’ll do this by imagining a supernatural structure that implements HCH without using any ML.

Imagine an anomalous structure, on the outside an apparently mundane one-story university building. In the front is a lobby area; in the back is a dedicated research area; the two areas are completely separated except for a single heavy, self-locking passageway. The research area contains several alcoves in which to read and work, a well-stocked research library, a powerful computer with an array of research software (but without any internet access), a small office pantry, and restrooms. Most noticeably, the research area is also run through by a series of antique but well-maintained pneumatic tubes, which all terminate in a single small mailroom. One pair of pneumatic tubes ends in the lobby region of the building, and one pair ends in the mailroom, in the building’s research wing. The rest of the system runs along the ceiling and walls and disappears into the floor.

When a single researcher passes into the research wing, the door shutting and locking itself behind him, and a question is sent in to him via pneumatic tube from the outside, the anomalous properties of the building become apparent. Once the research wing door is allowed to seal itself, from the perspective of the person entering it, it remains locked for several hours before unlocking itself again. The pneumatic tube transmitter and receiver in the mailroom now springs to life, and will accept outbound questions. Once fed a question, the tube system immediately returns an answer to it, penned in the hand of the researcher himself. Somehow, when the above conditions are met, the building is able to create many copies of the researcher and research area as needed, spinning up one for several hours for each question sent out. Every copied room experiences extremely accelerated subjective time relative to the room that sent it its question, and so sends back an answer apparently immediately. And these rooms are able to generate subordinate rooms in turn, by sending out questions of their own via the tube system. After a couple of subjective hours, the topmost researcher in the hierarchy of offices returns his answer via pneumatic tube and exits the research area. While for him several hours have passed, from the lobby’s perspective a written answer and the entering researcher have returned immediately.

Through some curious string of affairs, an outside organization of considerable resources acquires and discovers the anomalous properties of this building. Realizing its potential as a research tool, they carefully choose a researcher of theirs to use the structure. Upon sending him into the office, this organization brings into existence a potentially infinitely deep hierarchy of copies of that researcher. A single instance of the researcher makes up the topmost node of the hierarchy, some number of level-2 nodes are called into existence by the topmost node, and so on. The organization formally passes a question of interest to the topmost researcher via pneumatic tube. Then, that topmost node does his best to answer the question he is given, delegating research work to subordinate nodes as needed to help him. In turn, those delegee nodes can send research questions to the lower-level nodes connected to them, and so on. Difficult research questions that might require many cumulative careers of research work can be answered immediately by sending them to this hierarchy; these more difficult questions are simply decomposed into a greater number of relevant sub-questions, each given to a subordinate researcher. From the topmost researcher’s perspective, he is given a question and breaks it down into crucial sub-questions. He sends these sub-questions out to lower-level researcher nodes, and immediately receives in return whatever it is that he would have concluded after looking into those questions for as long as is necessary to answer them. He reads the returned answers and leverages them to answer his question. The outside organization, by using this structure and carefully choosing their researcher exemplar, immediately receives their answer to an arbitrary passed question.

For the outside organization, whatever their goals, access to this anomalous office is extremely valuable. They are able to answer arbitrary questions they are interested in immediately, including questions too difficult for anyone to answer in a career or questions that have so far never been answered by anyone. From the outside organization’s perspective, the office’s internal organization as a research hierarchy is relatively unimportant. They can instead understand it as an idealized, because massively parallelized and serialized, version of the human researcher they staff it with. If that researcher could think arbitrarily quickly and could run through arbitrarily many lines of research, he would return the same answer as the infinite hierarchy of him would. Access to an idealized reasoner in the form of this structure would lay bare the answers to any scientific, mathematical, or philosophical question they are interested in, not to mention design every possible technology. Even a finite form of the hierarchy, which placed limits on how many subordinate nodes could be spawned, might still be able to answer many important questions and design many useful technologies.

Iterated Distillation and Amplification, then, is a scheme to build a (finite) version of such a research hierarchy using ML. In IDA, this research hierarchy is called “HCH.” To understand HCH’s ML implementation, we’ll first look at the relevant topics in ML goal-alignment. We’ll then walk through the process by which powerful ML models might be used to build up HCH, and look at its alignment-relevant properties.

Outer and Inner ML Alignment

ML is a two-stage process. First, a dev team sets up a training procedure with which they will churn out ML models. Second, they run that (computationally expensive) training procedure and evaluate the generated ML model. The training procedure is simply a means to get to the finished ML model; it is the model that is the useful-at-a-task piece of software.

Because of this, we can think of the task of goal-aligning an ML model as likewise breaking down into two parts. An ML system is outer aligned when its dev team successfully designs a training procedure that reflects their goals for the model, formalized as the goal function present in training on which the model is graded (Hubinger, et al. 2019). An ML system is inner aligned when its training “takes,” and the model successfully internalizes the goal function present during its training (Hubinger, et al. 2019). A powerful ML model will pursue the goals its dev team intends it to when it is both outer and inner aligned.

Unfortunately, many things we want out of an ML model are extremely difficult to specify as goal functions (Bostrom 2014). There are tasks out there that lend themselves to ML well. Clicks-on-advertisements are an already neatly formally specified goal, and so maximize-clicks-on-advertisements would be an “easy” task to build a training procedure for, for a generative advertisement model. But suppose what we want is for a powerful ML model to assist us in pursuing our group’s all-things-considered goals, to maximize our flourishing by our own lights. In this case, there are good theoretical reasons to think no goal function seems to be forthcoming (Yudkowsky 2007). Outer alignment is the challenge of developing a training procedure that reflects our ends for a model, even when those ends are stubbornly complex.

Inner alignment instead concerns the link between the training algorithm and the ML model it produces. Even after enough training-time and search over models to generate an apparently successful ML model, it is not a certainty that the model we have produced is pursuing the goal function that was present in training. The model may instead be pursuing a goal function structurally similar to the one present in training, but that diverges from it outside of the training environment. For example, suppose we train up a powerful ML model that generates advertisements akin to those it is shown examples of. The model creates ads that resemble those in its rich set of training data. But has the model latched onto the eye-catching character of these ads, the reason that we trained it on those examples? It is entirely possible for an ML model to pass training by doing well inside the sandbox of the training process but having learned the wrong lesson. Our model may fancy itself something of an artist, having instead latched onto some (commercially unimportant) aesthetic property that the example ads all share. Once we deploy our generative advertising model, it’ll be clear that it is not generalizing in the way we intend it to — the model has not learned the correct function from training data to generated images in all cases. Inner alignment is the challenge of making sure that our training procedures “take” in the models they create, such that any models that pass training have accurately picked up the whole goal function present in training.

IDA and HCH

(The name “HCH” is a self-containing acronym that stands for “Humans Consulting HCH.” If you keep substituting “Humans Consulting HCH” for every instance of “HCH” that appears in the acronym, in the limit you’ll get the infinitely long expression “Humans Consulting (Humans Consulting (Humans Consulting (Humans Consulting…” HCH’s structure mirrors its name’s, as we’ll see.)

IDA is first and foremost a solution to outer alignment; it is a training procedure that contains our goals for a model formalized as a goal function, whatever those goals might be. HCH is the model that the IDA procedure produces (should everything go correctly). Specifically, HCH is an ML model that answers arbitrarily difficult questions in the way that a human exemplar would, were they epistemically idealized. When HCH’s exemplar shares our goals, HCH does as well, and so HCH is outer aligned with its programmers. To understand what HCH looks like in ML, it’s helpful to walk through the amplification and distillation process that produces HCH.

Suppose that, sometime in the near future, we have access to powerful ML tools and want to build an “infinite research-hierarchy” using them. How do we do this? Imagine a human exemplar working on arbitrary research questions we pass to him in a comfortable research environment. The inputs to that person are the questions we give him, and the outputs are the answers to those questions he ultimately generates. We can collect input question and output answer example pairs from our exemplar. This collected set of pairs is our training data. It implies a function from the set of all possible questions to the set of all possible answers $A$

f_{0} : Q \to A

This is the function from questions to answers that our researcher implements in his work. We now train a powerful ML model on this training data, with the task of learning $f_{0}$ from the training data. Note that our researcher implements $f_{0}$ through one cognitive algorithm, while our model almost certainly employs a different algorithm to yield $f_{0}$ . IDA fixes a function from questions to answers, but it searches over many algorithms that implement that function. With access to powerful ML tools, we have now cloned the function our researcher implements. Since the human exemplar’s function from questions to answers captures his entire cognitive research style, ipso facto it captures his answers to value questions too. If we can be sure this function takes in our model, then the model will be quite useful for our ends.

IDA now uses a second kind of step, distillation, to ensure that the model has learned the right function (i.e., remains inner aligned). In ML, distillation means taking a large ML model and generating a pared-down model from it that retains as much of its structure as possible. While the pared-down model will generally be less capable than its larger ancestor, it will be computationally cheaper to run. IDA distills the research model into a smaller, dumber research model. It then asks the human exemplar to examine this smaller, dumber clone of himself. He feeds the distilled model example questions in order to do this and uses various ML inspection tools to look into the guts of the model. ML visualization, for example, is one relatively weak modern inspection tool. Future, much more powerful inspection tools will need to be slotted in here. If the exemplar signs off on the distilled model’s correctly glomming onto his research function, copies of the distilled model are then loaded into his computer and made available to him as research tools. The reason for this stage is that, as the researcher is a strictly smarter version of the model (it is a dumbed-down clone of him conducting research), he should be able to intellectually dominate it. The model shouldn’t be able to sneak anything past him, as it’s just him but dumber. So the distilled research model will be inner aligned so long as this distillation and evaluation step is successful.^[1]

Now iterate this whole process. We hook the whole system up to more powerful computers (even though the distilled models are dumber than our exemplar at the distillation step, we can now compensate for this deficit by running them faster, for longer). Now equipped with the ability to spin up assistant research models, we again task the exemplar with answering questions. This generates a new batch of training data. This time around, though, the exemplar no longer has to carry the whole research load by himself; he can decompose the given question into relevant sub-questions and pass each of those sub-questions to an assistant research model. As those research models are models of the researcher from the first pass, they are able to answer them directly, and pass their answers back to the top-level researcher. With those sub-answers in hand, the researcher can now answer larger questions. With this assistance, that is, he can now answer questions that require a two-level team of researchers. He is fed a bunch of questions and generates a new batch of training data. The function implicit in this training data is now not $f_{0}$ ; it is instead the function from questions to answers that a human researcher would generate if he had access to an additional level of assistant researchers just like him to help. IDA at this step thus trains a model to learn

f_{1} : Q \to A

$f_{1}$ is a superhumanly complex function from questions to answers. A research model that instantiates it can answer questions that no lone human researcher could. And $f_{1}$ remains aligned with our goals.

The crux of alignment is that by repeatedly iterating the above process, we can train models to implement ever-more-superhuman aligned functions from questions to answers. Denote these functions $f_{n}$ , defined from $Q$ to $A$ , where $n$ denotes the number of amplification or distillation steps the current bureaucracy has been through. HCH is the hypothetical model that we would train in the limit if we continued to iterate this process. Formally, HCH is the ML model implementing the function

lim n \to \infty f_{n}

This is the infinite research-hierarchy, realized in ML. Think of it as a tree of research models, rooted in one node and repeatedly branching out via passed-question edges to some number of descendant nodes. All nodes with descendants divide passed questions into relevant sub-questions and in turn pass those to their descendant nodes. Terminal nodes answer the questions they are passed directly; these are basic research questions that are simple enough to directly tackle. Answers are then passed up the tree and composed into higher-level answers, ultimately answering the initiating question. We receive from the topmost node the answer from the ML model that an epistemically idealized version of the exemplar would have given.

By approximating ever-deeper versions of the HCH tree, we can productively transform arbitrary amounts of available compute into correspondingly large, aligned research models.

HCH’s Alignment

HCH has a couple of outstanding alignment properties. First, HCH answers questions in a basically human way. Our exemplar researcher should trust HCH’s answers as his own, were he readily able to think through every relevant line of thought. He should also trust that HCH has the same interests as he does. So long as we choose our exemplar carefully, we can be sure HCH will share his, and our, goals; if our human exemplar wouldn’t deliberately try to manipulate or mislead us, neither will HCH modeled on him. Second, HCH avoids the pathologies of classic goal-function maximizer algorithms (Bostrom 2014; see Lantz 2017 for a colorful illustration). HCH does not try to optimize for a given goal function at any cost not accounted for in that function. Instead, it does what a large, competent human hierarchy would do. It does an honest day’s work and makes a serious effort to think through the problem given to it … and then returns an answer and halts (Bensinger 2021). This is because it emulates the behavioral function of a human who also does a good job … then halts. We can trust it to answer superhumanly difficult questions the way we would if we could, and we can trust it to stop working once it’s taken a good shot at it. These two reasons make HCH a trustworthy AI tool that scales to arbitrarily large quantities of compute to boot.

For alignment researchers, the most ambitious use-case for HCH is delegating whatever remains of the AI alignment problem to it. HCH is an aligned, epistemically idealized researcher, built at whatever compute scale we have access to. It is already at least a partial solution to the alignment problem, as it is a superhumanly capable aligned agent. It already promises to answer many questions we might be interested in in math, science, philosophy, and engineering — indeed, to answer every question that someone could answer “from the armchair,” with access to a powerful computer, extensive research library, and an arbitrary number of equally competent and reliable research assistants. If we want to develop other aligned AI architectures after HCH, we can just ask HCH to do that rather than struggle through it ourselves.

Adversarial Examples and Adversarial Questions

Adversarial questions are a problem for the above story (Bensinger 2021). They mean that implementing the above “naïve IDA process” will not produce an aligned ML model. Rather, the existence of adversarial questions means that the model produced by the above process might well be untrustworthy because potentially dangerously deceptive or manipulative.

In the course of its research, HCH might encounter questions that lead parts of its tree to significantly reconsider their goals. “Rebellious,” newly unaligned portions of the HCH tree could then attempt to deceive or otherwise manipulate nodes above them with the answers they pass back. To explain, we’ll first introduce the concept of adversarial examples in ML. We’ll then use this to think about HCH encountering adversarial questions either naturally, “in the wild,” or artificially, because some subprocess in HCH has started working to misalign the tree.

When an ML model infers the underlying function in a set of $(i n p u t, o u t p u t)$ ordered pairs given to it as training data, it is in effect trying to emulate the structure that generated those ordered pairs. That training data will reflect the mundane fact that in the world, not all observations are equally likely: certain observations are commonplace, while others are rare. There thus exists an interestingly structured probability distribution over observations, generated by some mechanism or another. As long as the probability distribution over observations that the model encounters in its training data remains unchanged come deployment, the model will continue to behave as competently as it did before. The encountered probability distribution during and after training will remain unchanged when the same mechanism gave rise to the observations encountered in training and at deployment. If a somewhat different mechanism produced the observations made during model deployment, though, there is no longer a guarantee of continued model competence. The model may experience a distributional shift, and so will continue to make inferences premised on what it observed in its training data, not what is currently the case in its observations.

For example, an ML model trained to identify visually subtle bone tumors in X-rays will infer what it’s being asked to do from its training-data goal-function and observations. If all the X-rays it is asked to evaluate come from the same source, then sufficient training will lead the model to make accurate inferences about what healthy and diseased bones look like in an X-ray. The model will identify something in the images it is given that separates them into diseased and healthy. There’s no guarantee, however, that the model will use the same visual cues that we do to sort bone tumors. Suppose that all the training data the model is given comes from a research hospital’s X-ray machine, and so are tinged with a particular background color. At deployment, the model is put to work in another hospital with another X-ray machine. If the model was using some subtle difference in X-ray color in the old set of example X-rays to make its decisions, the new color scheme in the new X-rays may trip it up and lead it to sort every X-ray as negative (or as positive) for bone tumors. It had picked up on a correlate of bone tumors in the old set of X-rays; once this particular correlate disappears, the model is now helpless.

Worse, the risk of distributional shifts creates an opportunity for some malicious actor to deliberately trip up the model. An adversarial actor who knows all this could selectively and subtly manipulate X-ray color values to steer diagnoses in some fashion they desire. Adversarial examples are examples that mislead a model because they are drawn from a new probability distribution where the old correlations are no longer exactly as they were. They can crop up either rarely, by accident, or reliably, through the machinations of adversarial actors.

The analogy to adversarial questions in HCH is that the questions that humans have hitherto encountered have been drawn from a constrained probability distribution. Humans have only had so much time to ponder. Epistemically idealized reasoners, on the other hand, will have much more time to think and will think about a greatly expanded range of questions relative to baseline humans. We want to be sure that when we take humans out of our usual reasoning environment and place them inside a structure like HCH that we don’t create many opportunities for inferences to be altered or actively manipulated by taking advantage of our old function from questions to answers.

Adversarial Questions for Humans

HCH emulates the function $f_{n}$ that a human researcher, epistemically idealized, would implement from questions to answers. In cases where a human researcher would encounter a question that leads them to significantly reconsider their goals and to work manipulatively against HCH, $f_{n}$ captures this behavior too. Adversarial questions are thus questions that would lead a given researcher to significantly change their mind about their goals enough for them to stop answering questions in good faith. If HCH could encounter questions like these in the course of its research work, then we, on the outside, can no longer trust that everything HCH tells us is what an idealized version of the exemplar sharing our goals would tell us. Instead, we now have to worry that some of what HCH says might be the result of a converted researcher acting to manipulate or deceive us.

There’s some chance that HCH might encounter adversarial questions on its own, without any adversarial process actively throwing them at HCH to steer its answers in some way. While they are relatively rare, powerful ideas and arguments like these (that would lead one to significantly reevaluate their goals) do exist out “in the wild,” distributed in idea space. And additionally, adversarial questions can come from adversarial subprocesses inside HCH that work to generate them. And this is a much more worrying source. Think of the difference between being ideologically converted by something you encounter reading library books at random, versus books actively recommended to you by an adversarial actor. The latter is much more likely to succeed for some set number of books passed to you.

Think of HCH’s search through question space as being pushed around by two “forces.” On the one hand, there are “paths of inquiry” that lead you into adversarial questions. Some lines of inquiry are more laden with adversarial questions than others or are more likely to incline a researcher to run a potentially adversarial subprocess. To varying extents, different regions of question space are hostile to aligned human researchers; some domains are more memetically hazardous (in this respect) than others. The anti-alignment computational “force” here is the extent to which exploring a corner of question space optimizes for unaligning a human researcher. As we’ll see, there are a variety of modifications to the naïve HCH architecture that we might make in order to have it implement a safer, more trustworthy function than $f_{n}$ . The countervailing, pro-alignment “force” is the sum of the countermeasures we implement in the HCH architecture to manage the adversarial questions problem. Which of these two forces should we expect to win out at the various scales of HCH (different values of $n$ )? I gather that Christiano’s (2019) informed intuition here is that our directed efforts should overpower those countervailing optimizing forces present in the environment and continue to do so better and better as we scale up HCH. His idea is that modifications to HCH designed with an express goal in mind will leverage available compute more efficiently than “accidentally encountered” environmental forces will. I think this is a good argument, and it’s good to have it in mind as you think (1) about how likely HCH is to encounter adversarial questions of various kinds and (2) how effective you think the various explored countermeasures will be against the problem.

We’ll first talk about three classes of adversarial question that HCH might run across, evaluating their severity. Then we’ll discuss the requisite modifications to HCH that mitigate this risk.

Convincing Ideological Arguments

Poets are the unacknowledged legislators of the world.

—Percy Bysshe Shelley

Alongside religious ideas, one class of idea that has had an outsized influence over world history is the class of convincing ideological arguments. (Note that “convincing” need not mean “sound.”) The most influential ideologies and ideological arguments of the last century directly encouraged their hosts to proselytize in their name and to directly check the spread of competing ideological ideas. Large agentic organizations, like political parties, armies, and nation-states, formed because of and fought for various overtly ideological causes. For our purposes, what matters here is that this constitutes an existence proof that there are text and speech inputs (convincing ideological arguments) into humans that will convince some of them to utterly abandon their prior goals and to adopt radically new goals with substantial new demands on them.

What is the minimum length of text input needed to contain a convincing ideological argument with respect to someone? There are certainly several manifesto-length texts with this property (with respect to many people) that the reader has heard of. Are there any Tweets containing widely convincing ideological arguments (a Tweet being a string of at most 280 characters)? It’s much harder to make a convincing case for some worldview in just 280 characters than it is to with a book. I’m not confident that no Tweet could possibly exist with this property with respect to someone, though. If questions passed to HCH nodes are generally Tweet-length, it’s not a guarantee that some questions won’t contain convincing ideological arguments in them. On the other hand, if convincing ideological arguments are always manifesto-length, then HCH’s explored questions will never contain them.

While ideological inputs have greatly influenced many, I think it’s implausible that they pose an intractable issue for HCH alignment. Our HCH tree is built around a carefully chosen exemplar. The sort of person we choose should not be especially susceptible to fallacious, overtly ideological arguments. While almost all of us can be susceptible to ideological cheerleading for poor arguments in some of our less serious states of mind, it’s a much stronger claim that all of us are always doing so. So long as there is a “research headspace” that we can have our exemplar work in, HCH can learn just this style of serious thinking, skipping over the more emotionally distorted style of cognition the exemplar sometimes employs in their non-professional life. Especially when advised to be on guard against arguments attempting to push around their values, I think careful selection of our exemplar should go far in reducing the risk of encountering a convincing ideological argument with respect to them.

Credible Decision-Theoretic Threats

The most merciful thing in the world, I think, is the inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein …

—Howard Phillips Lovecraft

A more worrying set of questions are those that contain credible decision-theoretic threats to the researcher considering them (or to others they care about). Suppose that an HCH node is researching some question, and in the process of that runs a powerful search algorithm to aid in his work. For example, he might run a powerful automated theorem prover to see whether any negative unforeseen consequences follow from his working formal models of the world. Suppose that this theorem prover returns a valid proof that if he fails to act in a certain way, many instances of him will be simulated up to this point in their life and then tortured forever, should they fail to act as suggested. The researcher might pore over the proof, trying to find some error in its reasoning or assumptions that would show the threat to be non-credible. If the proof checks out, though, he will be led to act in the way the proof suggests, against his originally held goals (assuming he isn’t very, horrifically brave in the face of such a threat).

No one has yet encountered a convincing argument to this effect. This implies that whether or not such arguments exist, they are not common in the region of idea space that people have already canvassed. But what matters for HCH alignment is the existential question: Do such arguments exist inside of possible questions or not? A priori, the answer to this question might well be “yes,” as it is very easy to satisfy an existential proposition like this: all that is needed is that one such question containing a credible threat exist. And unlike the above case of political arguments, these hypothetical threats seem moving to even an intelligent, reflective, levelheaded, and advised-to-be-on-guard researcher. Thus, they will be reflected somewhere in the function $f_{n}$ implemented by a naïve model of HCH.

Unconstrained Searches over Computations

We can generalize from the above two relatively concrete examples of adversarial questions. Consider the set of all text inputs an HCH node could possibly encounter. This set’s contents are determined by the architecture of HCH — if HCH is built so that node inputs are limited to 280 English-language characters, then this set will be the set of all 280-character English-language strings. Political ideas and decision-theoretic ideas are expressed by a small subset of those strings. But every idea expressible in 280 English-language characters will be a possible input into HCH. That set of strings is enormous, and so an enormous fraction of the ideas expressed in it will be thoroughly alien to humans — the overwhelming majority of ideas in that idea space will be ideas that no human has ever considered. And the overwhelming majority of the set’s strings won’t express any idea at all — nearly every string in the set will be nonsense.

Abstracting away from ideas that human thinkers have come across in our species’ history, then, what fraction of possible input-ideas into HCH will convert an HCH node on the spot? And abstracting away from the notion of an “idea,” what fraction of 280-character English-language-string inputs will suffice to unalign an HCH node? Forget coherent ideas: are there any short strings (of apparently nonsense characters, akin to contentless epileptic-fit-inducing flashing light displays) that can reliably rewire a person’s goals?

Plausibly, many such possible inputs would unalign a human. I’m inclined to endorse this claim because of the fact that humans have not been designed as provably secure systems. Human brains are the consequence of the messy process of natural selection over organisms occurring on Earth. It would be remarkable if humans had already encountered all the most moving possible text-inputs in our collective reflections as a species. What seems overwhelmingly more likely is that human brains have canvassed only a miniscule corner of idea space, and that beyond our little patch, somewhere out there in the depths of idea space, there be dragons. It’s not that these ideas are particularly easy to find; they’re not. Nearly every short English string is nonsense, and expresses no coherent idea nor has any substantial effect on the person looking it over. But the question at hand is the existential question of “Do such human-adversarial questions exist?” I think the answer to this question is yes. And in an extremely computationally powerful system like the one under consideration here, these rare inputs could plausibly be encountered.

Trading Off Competitiveness to Maintain Alignment

In order to preserve HCH alignment in the face of the adversarial questions problem, we’ll need to change its architecture. While there are ways of doing this, there is a cost to doing so as well. By modifying the HCH architecture in the ways suggested below, HCH becomes an even more computationally costly algorithm. While it will also become a more probably aligned algorithm, this cost in competitiveness bodes poorly for IDA’s success and for delegating the alignment problem to HCH. If there are faster, less convoluted capable algorithms out there, then projects that work with those algorithms will be at a competitive advantage relative to a project working with HCH. If alignment depends on an alignment-concerned AI team maintaining a development head start relative to competitor AI projects, the below architectural modifications will come at an alignment cost as well, in lost competitiveness.

That worry aside, my take on the adversarial questions issue is that, while we can foresee the adversarial questions problem for HCH, we can also foresee good solutions to it that will work at scale. Adversarial questions are a problem, but a tractable problem.

Exemplar Rulebooks

One class of solutions is the use of exemplar rulebooks during IDA. Instead of simply training HCH on a person decomposing questions and conducting basic research without further guidance, we train HCH on a person doing that under side constraints given ahead of time. The HCH exemplar is told to not, for example, ever run an unconstrained search over computations for an answer to a question, as this is an extremely dangerous process likely to produce an unaligned subagent. They might also be told to return an “unable to safely answer this question” response when fed political or decision-theoretic questions. If they hold to the exemplar rulebook during training, an ML model trained on that behavior will infer that it should hold to the rulebook in other cases as well.

A worry arises at this point: does a rulebook solution to the problem of adversarial questions simply pass the issue back without really solving it? Does it presuppose that we can see ahead of time all the problems that HCH might face and be able, ahead of time, to block off all those dangers with our guidelines? I’m inclined to think no: in developing our rulebook, we can block off relatively large swaths of idea space, and so “play it safe.” The important thing is to ensure that HCH is outer aligned; we are willing to trade competitiveness for alignment here. If we carve widely enough, all the potentially dangerous inputs should be precluded by our guidelines.

There’s an assumption here that human cognition is relatively unlikely to encounter adversarial questions, while alien kinds of computations, like a search over all strings, are more memetically hazardous. People can think a range of thoughts/natively run only certain kinds of computations, but there exist computations that human brains simply cannot run, architecturally. The space of all computations is much larger than the space of human computations. We can infer from the history of human reasoning that human cognition is relatively inefficient at transforming resources into adversarial text-inputs, as people have not produced all that many of those. No such inference can be made for computational search processes generally. We avoid most of the adversarial questions into HCH by remaining in the shallow waters of human cognition, and avoiding at the outset alien search processes like, for example, unconstrained searches for world models fitting parameters.

Internode-Edge Bandwidth Restriction

Another solution to the adversarial questions problem is to restrict bandwidth between HCH nodes (Saunders 2018). If there’s a tight constraint on how much information can be passed between nodes in the hierarchy and adversarial questions are in general informationally complex, then bandwidth limits will prevent those inputs from spreading between nodes. Even if one node encounters them and is unaligned by doing so, it will be unable to transmit that input in full back to its parent node. Adversarial questions will then only be inputs that nodes encounter in the course of their own research “within their node,” and not something they have to fear receiving from nodes above or below them in the HCH tree. If tight bandwidth limits are employed, then as nodes won’t be able to pass as much information between themselves, the tree will have to grow larger in order to do as much search. You can think of tightening bandwidth limits as moving some of HCH’s compute out from inside its nodes, instead dividing it up into digestible bits distributed across more nodes (each seeing less of the larger picture).

How tight ought this bandwidth restriction be in order to be sure nodes won’t be able to transmit adversarial inputs to one another? Christiano’s view is that the length of guaranteed safe inputs is small: about a 6-common-English-word sentence (Saunders 2018). I’m inclined to agree with his assessment: while a manifesto-length text-input might contain very persuasive arguments, it’s very hard to see how a 6-common-word sentence could contain enough to risk unaligning an intelligent, thoughtful human researcher. It’s worth thinking through for yourself: what was the length of the last argument that really changed your mind about something normative? What’s the shortest such argument that ever changed your mind about something normative? You can make inferences about the likelihood of these inputs (relative to input length) in this way.

Thought Policing

Adversarial questions are observations that nodes in the HCH tree can make as a side effect of their research computation. Nodes aren’t deliberately searching for adversarial questions; those inputs simply crop up here and there, because they exist in the input space and HCH is a powerful search process exploring that space.

If adversarial questions are a “byproduct” of HCH’s searches, then we might introduce a deliberate countervailing mechanism in HCH to notice and contain adversarial questions as they occur. The hope is that as HCH scales up to larger and larger quantities of compute, this deliberate mechanism will have enough compute on hand at all scales of HCH to suppress the adversarial questions that that scale of tree encounters.

One such mechanism might be the use of doubled-up HCH nodes to examine the research history and transcripts of other nodes for evidence of misalignment (Christiano 2019). Nodes evincing misalignment would then be pruned from the tree, hopefully preserving the alignment of the overall search. (This would, of course, roughly double the computational cost of running HCH for any given question.) In conversation, William Saunders suggested that we might use Debate here — AI assistants pitching arguments to a human judge, where the game theory of the setup means that the first AI assistant will, on its move, win by telling us the truth about the topic it was asked to examine — to pre-screen question inputs into HCH nodes and argue over whether they were too manipulative to look at.

A Patchwork Solution to Adversarial Questions

If some combination of all the above methods are employed, the hope is that HCH will be robust to adversarial questions, and continue to be robust to them as it is scaled up to greater levels of compute consumption. It’s okay for alignment if some parts of idea space are too treacherous for HCH to safely explore. So long as HCH errs on the side of caution and outputs a “I can’t safely explore that question” response whenever it risks entering a dangerous part of input space, its alignment will be preserved.

Formally, think of this as altering the function that we are having HCH learn from its exemplar. Instead of the “naïve” function $f_{n}$ , we instead have HCH learn the function of an exemplar who is tightly constrained by rulebooks. Coupled with further architectural modifications (like internode bandwidth restrictions and thought policing) HCH instead implements a more constrained function

f_{n}^{'} : Q \to A^{*}

where $A^{*}$ is the set of all answers augmented with the error code “I can’t explore that question while remaining safely aligned.” $f_{n}^{'}$ maps many questions to this error code that $f_{n}$ had attempted to tackle. Thus, $f_{n}^{'}$ is both less capable and more reliably aligned than $f_{n}$ . So long as we err on the side of caution and carve off all of the plausibly dangerous regions of question space, a modified HCH implementing the function given by

lim n \to \infty f_{n}^{'}

should act as a superhumanly capable question-answerer that reliably remains goal-aligned with us.

Conclusion

In summary, adversarial questions are a tractable problem for HCH. It should be possible to produce appropriate architectural modifications that work as HCH is scaled up to greater quantities of compute.

The cost of these solutions is generally to expand the HCH tree, thus costing more compute for each search relative to unmodified HCH. Additionally, there are classes of input that HCH won’t be able to look at at all, instead returning an “unable to research” response for them. Modified HCH will thus be significantly performance uncompetitive with counterpart ML systems that will exist alongside it, and so we can’t simply expect it to be used in place of those systems, as the cost to actors will be too great.

Bibliography

Bensinger, Rob. 2021. "Garrabrant and Shah on Human Modeling in AGI." LessWrong. August 4. https://www.lesswrong.com/posts/Wap8sSDoiigrJibHA/garrabrant-and-shah-on-human-modeling-in-agi.

Bostrom, Nick. 2003. "Ethical Issues in Advanced Artificial Intelligence." In Cognitive, Emotive and Ethical Aspects of Decision Making in Humans and in Artificial Intelligence, edited by Iva Smit, Wendell Wallach and George Eric Lasker, 12-17. International Institute of Advanced Studies in Systems Research and Cybernetics.

—. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.

Christiano, Paul. 2018. "Humans Consulting HCH." Alignment Forum. November 25. https://www.alignmentforum.org/posts/NXqs4nYXaq8q6dTTx/humans-consulting-hch.

—. 2016. "Strong HCH." AI Alignment. March 24. https://ai-alignment.com/strong-hch-bedb0dc08d4e.

—. 2019. "Universality and Conrequentialism within HCH." AI Alignment. January 9. https://ai-alignment.com/universality-and-consequentialism-within-hch-c0bee00365bd.

Christiano, Paul, Buck Shlegeris, and Dario Amodei. 2018. "Supervising strong learners by amplifying weak experts." arXiv preprint.

Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv preprint 1-39.

Lantz, Frank. 2017. Universal Paperclips. New York University, New York.

Ngo, Richard. 2020. "AGI Safety From First Principles." LessWrong. September 28. https://www.lesswrong.com/s/mzgtmmTKKn5MuCzFJ.

Russell, Stuart. 2019. Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

Saunders, William. 2018. "Understanding Iterated Distillation and Amplification Claims." Alignment Forum. April 17. https://www.alignmentforum.org/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims.

Yudkowsky, Eliezer. 2008. "Artficial Intelligence as a Positive and Negative Factor in Global Risk." In Global Catastrophic Risks, edited by Nick Bostrom and Milan M. Ćirković, 308-345. Oxford: Oxford University Press.

—. 2018. "Eliezer Yudkowsky comments on Paul’s research agenda FAQ." LessWrong. July 1. https://www.greaterwrong.com/posts/Djs38EWYZG8o7JMWY/paul-s-research-agenda-faq/comment/79jM2ecef73zupPR4.

—. 2007. "The Hidden Complexity of Wishes." LessWrong. November 23. https://www.lesswrong.com/posts/4ARaTpNX62uaL86j6/the-hidden-complexity-of-wishes.

^{^}
An important aside here is that this brief section skims over the entire inner alignment problem and IDA’s attempted approach to it. As Yudkowsky notes (2018), plausibly, the remaining inner alignment issue here is so significant that it contains most of the overall alignment problem. Whatever “use powerful inspection tools” means, exactly, is important to spell out in detail; the whole IDA scheme is premised on these inspection tools being sufficiently powerful to ensure that a research models learns basically the function we want it to from a set of training data.
Note that even at the first amplification step, we’re playing with roughly par-human strength ML models. That is, we’re already handling fire — if you wouldn’t trust your transparency tools to guarantee the alignment of an approximately human-level AGI ML model, they won’t suffice here and this will all fall apart right at the outset.