# 7

Note: This post came out of a conversation with Geoffrey Irving and Buck Shlegeris.

Epistemic Status: I suspect Paul has already thought of most or all of the ideas presented here, though I nevertheless found the exercise of carefully specifying an IDA implementation helpful and suspect others may find reading it helpful as well.

This is a proposal for how to train a machine learning model to approximate HCH using Iterated Distillation and Amplification (IDA). This particular proposal came out of a desire to use a debate-like adversary to improve the amplification process, and the primary goal of this proposal is to show how one could do that. Though I have tried to retain a lot of the relevant detail, I have made two simplifications to make this proposal easier to specify: I am attempting to approximate something closer to weak HCH rather than strong HCH and I am only allowing the generation of two subquestions at a time. I am confident that those simplifications could easily be dropped, though I think doing so here would only make this presentation more complicated.

Before I proceed, I want to make one final note: this is not a proposal for how to build an aligned AGI. I think there are still a whole bunch of issues that would prevent this proposal from actually working.

## Definitions

• Let be the set of all questions in natural language.
• Let be the set of all answers in natural language.
• Let be the sum type of either or representing either an answer to the given question or two subquestions to help answer it.
• Let be the answer that a human gives to the given question.
• Let be the answer or subquestion pair generated by a human when asked what to do with the given question.
• Let be the answer or two subquestions generated by a human to some question when given answers to two subquestions related to that question.
• Let be a model (the training procedure for which we will describe below) from questions to a probability distribution over strings representing answers. Specifically, we will implement the probability distribution by having our model output an embedding vector which yields the probability distribution when fed into some trained language model (by repeatedly conditioning on previous characters and multiplying all the conditional probabilities).
• Let be the function from questions to an embedding vector representing a distribution over answers or subquestions generated by asking what to do with the given question. For the two subquestion case we enforce that the distribution is symmetric wrt interchange of the subquestions.
• Let be the function from two answers to an embedding vector representing a distribution over answers or subquestions generated by asking to integrate the given subquestion answers into an answer to the original question. We again enforce symmetry wrt interchange of the subquestions.
• Let be an adversary model which we will train (as described below) to predict how good the given answer is to the given question.
• Let be an adversary model for generated by calling using the same transformation as used to implement from .
• Let be an adversary model for generated by calling using the same transformation as used to implement from .
• Let be some Gumbel random variable such that each use of below is its own independent sample from a Gumbel distribution. We use a Gumbel random variable so that we can use the Gumbel-max trick to effectively do Boltzmann exploration.

## Amplification

The goal here is to use the adversary to determine which portions of the HCH tree we actually go down to maximize our sample efficiency wrt . Specifically, the amplification procedure is as follows:

1. Generate some question either from the training distribution or from the output of some adversary.

2. Call . If we get an answer, let and skip to the end. Otherwise, let .

3. Call to generate scores and .

4. Without loss of generality, let and recurse on (by setting in the subcall) to generate (from the of the subcall) and let .

5. Call . If we get an answer, let . Otherwise, let and return to step 3.

This procedure should eventually produce an answer, though we then need to compute the following losses for :

Note that some of these losses are actually for the same model since are implemented internally by calling and are implemented internally by calling .

## Distillation

For the distillation procedure, we want to improve by training it to approximate its own -consulting- tree. Specifically:

1. Generate some as in amplification.

2. Call . If we get an answer, let and skip to the end. Otherwise, let .

3. Let and .

4. Call . If we get an answer, let . Otherwise, let and return to step 3.

Then, compute the following losses for :

## Iteration

Finally, we want to iterate the above procedure by replacing in the amplification procedure with some . First, let be some confidence threshold. Then, we will define the following primed s:

• Let .
• Let
• Let