A Concrete Proposal for Adversarial IDA

evhub

Note: This post came out of a conversation with Geoffrey Irving and Buck Shlegeris.

Epistemic Status: I suspect Paul has already thought of most or all of the ideas presented here, though I nevertheless found the exercise of carefully specifying an IDA implementation helpful and suspect others may find reading it helpful as well.

This is a proposal for how to train a machine learning model to approximate HCH using Iterated Distillation and Amplification (IDA). This particular proposal came out of a desire to use a debate-like adversary to improve the amplification process, and the primary goal of this proposal is to show how one could do that. Though I have tried to retain a lot of the relevant detail, I have made two simplifications to make this proposal easier to specify: I am attempting to approximate something closer to weak HCH rather than strong HCH and I am only allowing the generation of two subquestions at a time. I am confident that those simplifications could easily be dropped, though I think doing so here would only make this presentation more complicated.

Before I proceed, I want to make one final note: this is not a proposal for how to build an aligned AGI. I think there are still a whole bunch of issues that would prevent this proposal from actually working.

Definitions

We will start with some initial definitions:

Let $Q$ be the set of all questions in natural language.
Let $A$ be the set of all answers in natural language.
Let $M$ be the sum type of either $Q \times Q$ or $A$ representing either an answer to the given question or two subquestions to help answer it.
Let $H : Q \to A$ be the answer that a human gives to the given question.
Let $H_{fan out} : Q \to M$ be the answer or subquestion pair generated by a human when asked what to do with the given question.
Let $H_{fan in} : Q \times (Q \times A) \times (Q \times A) \to M$ be the answer or two subquestions generated by a human to some question when given answers to two subquestions related to that question.
Let $M L : Q \to Δ (A)$ be a model (the training procedure for which we will describe below) from questions to a probability distribution over strings representing answers. Specifically, we will implement the probability distribution by having our model output an embedding vector which yields the probability distribution when fed into some trained language model (by repeatedly conditioning on previous characters and multiplying all the conditional probabilities).
Let $M L_{fan out} : Q \to Δ (M)$ be the function from questions to an embedding vector representing a distribution over answers or subquestions generated by asking $M L$ what to do with the given question. For the two subquestion case we enforce that the distribution is symmetric wrt interchange of the subquestions.
Let $M L_{fan in} : Q \times (Q \times A) \times (Q \times A) \to Δ (M)$ be the function from two answers to an embedding vector representing a distribution over answers or subquestions generated by asking $M L$ to integrate the given subquestion answers into an answer to the original question. We again enforce symmetry wrt interchange of the subquestions.
Let $Adv : Q \times A \to R$ be an adversary model which we will train (as described below) to predict how good the given answer is to the given question.
Let ${Adv}_{fan out} : Q \times M \to R$ be an adversary model for $M L_{fan out}$ generated by calling $Adv$ using the same transformation as used to implement $M L_{fan out}$ from $M L$ .
Let ${Adv}_{fan in} : Q \times (Q \times A) \times (Q \times A) \times M \to R$ be an adversary model for $M L_{fan in}$ generated by calling $Adv$ using the same transformation as used to implement $M L_{fan in}$ from $M L$ .
Let $ϵ$ be some Gumbel random variable such that each use of $ϵ$ below is its own independent sample from a Gumbel distribution. We use a Gumbel random variable so that we can use the Gumbel-max trick to effectively do Boltzmann exploration.

Amplification

The goal here is to use the adversary to determine which portions of the HCH tree we actually go down to maximize our sample efficiency wrt $H$ . Specifically, the amplification procedure is as follows:

1. Generate some question $Q \in Q$ either from the training distribution or from the output of some adversary.

2. Call $H_{fan out} (Q)$ . If we get an answer, let $A = H_{fan out} (Q)$ and skip to the end. Otherwise, let $S Q_{1}, S Q_{2} = H_{fan out} (Q)$ .

3. Call $Adv$ to generate scores $S_{1} = Adv (S Q_{1}, M L (S Q_{1}))$ and $S_{2} = Adv (S Q_{2}, M L (S Q_{2}))$ .

4. Without loss of generality, let $S_{1} + ϵ \leq S_{2} + ϵ$ and recurse on $S Q_{1}$ (by setting $Q = S Q_{1}$ in the subcall) to generate $A_{1}$ (from the $A$ of the subcall) and let $A_{2} = M L (S Q_{2})$ .

5. Call $H_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ . If we get an answer, let $A = H_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ . Otherwise, let $S Q_{1}, S Q_{2} = H_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ and return to step 3.

This procedure should eventually produce an answer, though we then need to compute the following losses for $M L, M L_{fan out}, M L_{fan in}, Adv, {Adv}_{fan out}, {Adv}_{fan in}$ :

\begin{matrix} L = - log (M L (Q) |_{A}) L_{fan out} = - log (M L_{fan out} (Q) |_{H_{fan out} (Q)}) L_{fan in} = - log (M L_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2})) |_{H_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))}) L_{Adv} = {∣ ∣ S_{1} - log (M L (S Q_{1}) |_{A_{1}}) ∣ ∣}_{1}^{2} L_{{Adv}_{fan out}} = {| {Adv}_{fan out} (Q, (S Q_{1}, S Q_{2})) + L_{fan out} |}_{fan out}^{2} L_{{Adv}_{fan in}} = {| {Adv}_{fan in} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2}), A) + L_{fan out} |}_{fan in}^{2} \end{matrix}

Note that some of these losses are actually for the same model since $M L_{fan out}, M L_{fan in}$ are implemented internally by calling $M L$ and ${Adv}_{fan out}, {Adv}_{fan in}$ are implemented internally by calling $Adv$ .

Distillation

For the distillation procedure, we want to improve $M L$ by training it to approximate its own $M L$ -consulting- $M L$ tree. Specifically:

1. Generate some $Q \in Q$ as in amplification.

2. Call $M L_{fan out} (Q)$ . If we get an answer, let $A = M L_{fan out} (Q)$ and skip to the end. Otherwise, let $S Q_{1}, S Q_{2} = M L_{fan out} (Q)$ .

3. Let $A_{1} = M L (S Q_{1})$ and $A_{2} = M L (S Q_{2})$ .

4. Call $M L_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ . If we get an answer, let $A = M L_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ . Otherwise, let $S Q_{1}, S Q_{2} = M L_{fan in} (Q, (Q_{1}, A_{1}), (Q_{2}, A_{2}))$ and return to step 3.

Then, compute the following losses for $M L, Adv$ :

\begin{matrix} L = - log (M L (Q) |_{A}) L_{Adv} = {| Adv (Q, A) + L |}^{2} \end{matrix}

Iteration

Finally, we want to iterate the above procedure by replacing $H, H_{fan out}, H_{fan in}$ in the amplification procedure with some $H^{'}, H_{fan out}^{'}, H_{fan in}^{'}$ . First, let $α$ be some confidence threshold. Then, we will define the following primed $H$ s:

Let $H^{'} (Q) = if Adv (Q, M L (Q)) + ϵ < α + ϵ then H (Q) else M L (Q)$ .
Let

\begin{matrix} H_{fan out}^{'} (Q) = {if Adv}_{fan out} (Q, M L_{fan out} (Q)) + ϵ < α + ϵ then H_{fan out} (Q) else M L_{fan out} (Q) \end{matrix}

\begin{matrix} H_{fan in}^{'} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2})) = {if Adv}_{fan in} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2}), M L_{fan in} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2}))) + ϵ < α + ϵ then H_{fan in} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2})) else M L_{fan in} (Q, (S Q_{1}, A_{1}), (S Q_{2}, A_{2})) \end{matrix}

This procedure allows us to continue amplifying the model while using the adversary to require only minimal human data that is selected so as to be maximally helpful.

Conclusion

This proposal differs in a couple of ways from previous proposals made by Paul. First, Paul has recently moved away from discrete amplification/distillation steps. This proposal, however, provides a way to recover discrete steps while still collapsing the recursion. In practice, however, you might still just want to stick with the amplification procedure described here without doing the distillation step, as it isn't strictly necessary.

Second, this proposal uses an adversary to guide the training process. This technique is similar to the concept of importance sampling. The main benefit of this approach is that it takes advantage of active learning by allowing the system to choose which questions and subquestions would be most useful for it to have answered by a human.

Another benefit of the adversary, however, is that it could make transparency much easier. One of the major benefits of IDA is that $M L$ gets trained to approximate its own $M L$ -consulting- $M L$ tree. As a result, understanding the reasoning that went into the final answer produced by $M L$ can be recovered by unfolding its tree (at least in the limit of perfect training). However, unfolding the entire tree is very expensive, as it's linear in the size of the tree. With an adversary, however, you can choose which portions of the tree to unfold first by calling the adversary, enabling you to find errors much more quickly; for a perfect adversary, this reduces the problem of finding an error to $O (log n)$ instead of $O (n)$ .

Thus, the hope is that the use of such an adversary could assist both in making IDA more competitive (by increasing sample efficiency and using active learning) and in making IDA safer (due to the increased ease of transparency).

It should be noted, however, that it is also possible that the use of such an adversary might make the safety situation for IDA worse. First, it introduces the possibility of a robustness to relative scale failure if either $M L$ or $Adv$ gets significantly stronger than the other. One possible way to resolve such an issue, however, might be to give $Adv$ the ability to call $M L$ and vice versa, allowing them to use each other to boost their own capabilities. Second, for an $M L$ and $Adv$ system that are themselves optimizers, with goals that don't perfectly match up with their loss functions, they could cooperate to make it arbitrarily unlikely that $H$ is ever consulted on some specific question. Third, even if $M L$ and $Adv$ weren't cooperating, an RSA-2048-style failure could still prevent the identification of malicious cognition. Resolving failures of these second two types is still an open question (EDIT: see "Risks from Learned Optimization in Advanced Machine Learning Systems," by Hubinger, van Merwijk, Mikulik, Skalse, and Garrabrant).

Planned entries for the newsletter:

Summary:

This post presents a method to use an adversary to improve the sample efficiency (with respect to human feedback) of iterated amplification. The key idea is that when a question is decomposed into subquestions, the adversary is used to predict which subquestion the agent will do poorly on, and the human is only asked to resolve that subquestion. In addition to improving sample efficiency by only asking relevant questions, the resulting adversary can also be used for interpretability: for any question-answer pair, the adversary can pick out specific subquestions in the tree that are particularly likely to contain errors, which can then be reviewed.

Opinion:

I like the idea, but the math in the post is quite hard to read (mainly due to the lack of exposition). The post also has separate procedures for amplification, distillation and iteration; I think they can be collapsed into a single more efficient procedure, which I wrote about in this comment.

Given that you are training the model during amplification, I don't really see why you also have a distillation step, and an iteration step. I believe the point of that separation is to allow amplification to not involve ML at all, so that you can avoid dealing with the issues around bootstrapping -- but if you train while amplifying, you are already bootstrapping. In addition, you're requiring that exactly one subquestion be sent to the human, but it seems better to allow it to be zero, one or two, depending on how confident the adversary is in the ML model's answer. Concretely, I would get rid of both distillation and iteration, and change step 4 of the amplification procedure:

4. For $i \in {1, 2}$ , flip a biased coin $r_{i} \sim Bernoulli (f (S_{i}))$ , where $f$ is a function that computes recursion probabilities from adversary scores. If $r_{i} = True$ , compute $A_{i}$ by recursing on $S Q_{i}$ , else set $A_{i} = M L (S Q_{i})$ .

You could compute $r_{i} = S_{i} + ϵ < α + ϵ$ if you want to use a confidence threshold with Boltzmann exploration.

This new procedure allows for the behavior you have with distillation, in the cases where it actually makes sense to do so: you recover distillation in the case where the adversary thinks that the answers from $M L$ to both subquestions are good.

The last two adversary losses have a typo: you should be computing the difference between the adversary's prediction and the true loss, not the sum.

Meta: I found this post quite hard to read, since everything was written in math with very little exposition.

I considered collapsing all of it into one (as Paul has talked about previously), but as you note the amplification procedure I describe here basically already does that. The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to $H$ . I do agree that you could include the iteration procedure described here into the amplification procedure, which is probably a good idea, though you'd probably want to anneal $α$ in that situation, as $Adv$ starts out really bad, whereas in this setup you shouldn't have to do any annealing because by the time you get to that point $Adv$ should be performing well enough that it will automatically anneal as its predictions get better. Also, apologies for the math--I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

Makes sense. I think I could not tell how much I should be trying to understand this until I understood it. I probably would have chosen not to read it if I had known how long it would take and how important I thought it was (ex-post, not ex-ante). For posts where that's likely to be true, I would push for not posting at all.

Another way you could see this: given my current state of knowledge about this post, I think I could spend ~15 minutes making it significantly easier to understand. The resulting post would have been one that I could have read more than 15 minutes faster, probably, for the same level of understanding.

I think it's not worth making a post if you don't get at least one person reading it in as much depth as I did; so you should at the very least be willing to trade off some of your time for an equal amount of time of that reader, and the benefit scales massively the more readers you have. The fact that this was not something you wanted to do feels like a fairly strong signal that it's not worth posting since it will waste other people's time.

(Of course, it might have taken you longer than 15 minutes to make the post easier to understand, or readers might usually not take a whole 15+ minutes more to understand a post without exposition, but I think the underlying point remains.)

The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to H

Note that my proposed modification does allow for that, if the adversary predicts that both of the answers are sufficiently good that neither one needs to be recursed on. Tuning $α$ in my version should allow you to get whatever sample efficiency you want. An annealing schedule could also make sense.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

Ah, yeah, I see it now.

Planned entries for the newsletter:

Summary:

Opinion:

You could compute $r_{i} = S_{i} + ϵ < α + ϵ$ if you want to use a confidence threshold with Boltzmann exploration.

The last two adversary losses have a typo: you should be computing the difference between the adversary's prediction and the true loss, not the sum.

Meta: I found this post quite hard to read, since everything was written in math with very little exposition.

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

I didn't really have the time to write up more explanation, so it was a choice between posting it as is or not posting it at all, and I went with posting it as is.

The point of the distillation step, thus, is just to increase sample efficiency by letting you get additional training in without requiring additional calls to H

(Also, the sum isn't a typo--I'm using the adversary to predict the negative of the loss, not the loss, which I admit is confusing and I should probably switch it.)

Ah, yeah, I see it now.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

8

A Concrete Proposal for Adversarial IDA

8

Definitions

Amplification

Distillation

Iteration

Conclusion