Understanding the two-head strategy for teaching ML to answer questions honestly

0james.lucassen

New Comment

Proposed toy examples for G:

- G is "the door opens", a- is "push door", a+ is "some weird complicated doorknob with a lock". Pretty much any b- can open a-, but only a very specific key+manipulator combo opens a+. a+ is much more
*informative*about successful b than a- is. - G is "I make a million dollars", a- is "straightforward boring investing", a+ is "buy a lottery ticket". A wide variety of different world-histories b can satisfy a-, as long as the markets are favorable - but a very narrow slice can satisfy a+. a+ is a more
*fragile*strategy (relative to noise in b) than a- is.

This post is the result of my attempts to understand what’s going on in these two posts from summer 2021:

Paul Christiano:

https://www.alignmentforum.org/posts/QqwZ7cwEA2cxFEAun/teaching-ml-to-answer-questions-honestly-instead-ofEvan Hubinger: https://www.alignmentforum.org/posts/gEw8ig38mCGjia7dj/answering-questions-honestly-instead-of-predicting-human

The underlying problem is similar in some ways to Eliciting Latent Knowledge, although not quite the same. According to Paul, understanding the two-headed algorithm (described in these posts) is likely to be fairly useful to people working on ELK. A similar algorithm is described in the ELK report, under Strategy: penalize reporters that work with many different predictors and in related appendices.

I initially found the algorithm pretty confusing, so I wrote this post for anyone else who's confused by it.

I start with the basic strategy underlying Paul’s post, but in a simpler / more general setting without any ML. Then I add back in the ML context, followed by the specific context of the problem Paul and Evan discuss in their posts. (This is not quite the same as the ELK context.)

Along the way, I outline a simple application of this strategy: finding the second-simplest “meaningfully distinct” solution to a problem. I note that Paul’s application is sort of a special case of this.

## Simple abstract version: conditioning on G(a, b)

Suppose we have a probability distribution over some discrete space A × B:

p:A×B→R

This is shown in the figure below -- dots represent probability mass:

Suppose there’s some point (a

^{+}, b^{+}) in A × B, and we want to obtain it by sampling from this distribution. Unfortunately, it’s not very high probability, so we probably won’t succeed:## Conditioning on D(·, ·)

An obvious strategy, if sampling is very cheap, is to sample repeatedly until we get (a

^{+}, b^{+}).Unfortunately, we don’t even know how to recognize it if we had it -- there are other points that look very similar. Worse still, some of them are higher in probability.

Formally, let’s say that we have a predicate D: A × B → {0, 1}, which is true for (a

^{+}, b^{+}) and false for most other points. It’s also true for some other, unwanted points, with (a^{-}, b^{-}) being the highest-probability one:We still don’t know how to get (a

^{+}, b^{+}), though -- the best we can do is (a^{-}, b^{-}). We don’t knowanyeasy-to-check predicate that is true for one and false for the other.## Conditioning on G(a, ·)

Let’s suppose that we

doknow of an easy-to-check predicate G:A×B→{0,1}, such thatWe can abbreviate the last part as

In other words:

^{-}randomly while keeping a^{-}, G is fairly likely to still be true.^{+}randomly while keeping a^{+}, G is less likely to still be true.This is a somewhat nonintuitive property, but I’ll give a more motivated example of it in a more specific setting later. If you can think of nice simple toy models where a G like this exists, please let me know!

This predicate might seem unhelpful -- G is less robustly true for the point we want -- but it’s actually exactly what we need.

With such a G in hand, the basic idea is to use rejection sampling to throw out anything for which G=0, and redistribute its probability (under p) to other values

with the same A part. This will leave anything of the form (a^{+}, b) s.t. G(a^{+}, b)=1 with a bunch of extra probability mass, but won’t help (a^{-}, b) nearly as much.In pseudocode:

This figure shows the effect of this procedure. After redistributing from the G=0 set (shaded in red), the (+) point has higher probability than the (-) point, because it ate more of its neighbors.

The distribution p

_{G}that this produces is given bypG(a,b)=p(a,b)/p(G|a)

We have two inequalities to worry about:

As long as the second inequality is stronger (the ratio of the sides is larger), we’ll have

pG(a+,b+)>pG(a−,b−)

We still have to worry about any

otherpoints (a’, b’) which get a large boost from conditioning on G in this way; it’s not obvious that (a^{-}, b^{-}) is still the “biggest competitor” to (a^{+}, b^{+}). In this toy example, that turns out not to matter, but in the ML applications below it’ll be important.## Special case: second-simplest-point algorithm

Here’s a use case for the procedure above:

Suppose we have some family of closely-related simplicity priors. These are defined on a space X, another space B, and various combinations of these such as X × B and (X × X) × B. In an abuse of notation, we’ll call all of these p.

I said p was a “simplicity prior”. What I mean by this is things like the following:

These properties of p ought to be true for any notion of “simplicity”. The combination of two values is at least as complicated as either of them, and combining a value with itself is approximately as complicated as that value already is. We don’t need anything stronger than this at the moment.

We have some predicate D0(x,b), which checks that (x, b) is “correct” in some complicated way.

We also have a predicate G0((x,b),(x′,b′)), which checks that (x, b) is

similar to(x’, b’) in some complicated way.You can think of these as deriving from some large set of tests -- D

_{0}checks that the test outputs are all correct, and G_{0}just checks that they’re allthe samefor (x, b) and (x’, b’).We want the

second-simplestpoint that satisfies D_{0}. We’re still pretty happy if we can get both the first- and second-simplest points satisfying D_{0}, even if we don’t know which is which.Let’s assume for simplicity (heh) that they have the same B-part:

(If they have different B-parts b

^{-}and b^{+}, there is probably still a way to do this, but I haven’t tried to work out what it is.)D

_{0}is true for both of these, and G_{0}can’t tell them apart, because they both pass all the tests.Let’s further assume that randomly changing b* to something else is very likely to break these properties. For the simplest values of b, (x

^{-}, b) and (x^{+}, b) both fail our testsin different ways, so G_{0}candistinguish them:We can’t just rejection-sample for D

_{0}from p directly, or we’ll probably get (x^{-}, b*).Instead, let’s define A=X×X and start working in the space A×B=(X×X)×B.

Here’s how we can use G

_{0}to implement the scheme described above:_{0}(x, b) or D_{0}(x’, b))._{0}((x, b), (x’, b))This G is trivial for a=a

^{-}, because the two things there are the same inallrespects, no matterwhatb is, sop(G|a−)≈1

But for a=a

^{+}, G will fail formanydifferent values of b, because x^{+}and x^{-}behave differently except when combined with b*:p(G|a+)≪1

Now we have all the pieces and satisfy all the necessary conditions, so we can do the two-loop procedure described above in pseudocode to obtain (a

^{+}, b*).This doesn’t quite get us (x

^{+}, b*) -- we don’t know which x is which -- but we’re most of the way there.Exercise for the reader: How do you generalize this to get the k simplest “meaningfully distinct” X values, for k>2?

It would be interesting to do some small experiments on this in a standard ML setting like image recognition. What’s the second-simplest way to recognize a dog?

## ML interpretation

All of this has been pretty abstract -- let’s bring in all the baggage of machine learning:

^{+}, b^{+}) and (a^{-}, b^{-}) are two possible trained models; (a^{+}, b^{+}) is better behaved in some way but is disfavored by the prior.^{+}, in addition to other functionality, imposes the condition that the boolean output is only 1 if the B-part of the model satisfies some property G^{+}that is:^{+}^{-}doesn’t impose any condition like this, because there’s no G^{-}that is comparably cheap and comparably specific to b^{-}.The biggest difference between this and the more abstract setup is that a

^{+}and a^{-}get to implement their own conditions G^{+}, G^{-}, with the fixed externally-specified G just being a wrapper around these. I think this is actually unavoidable for sufficiently flexible ML models; the A-part of the model can always just return nonsense if its chosen condition isn’t satisfied, so long as that condition isn’t penalized much by the inductive prior. We aren’t assuming any ability to check things other than network outputs, so the G given above is pretty much fully general.## Alignment application: Intended and instrumental model

The use case Paul came up with this technique for is a special case of the ML version of this algorithm, with the following substitutions:

^{+}: Answer is computed using the intended model; output bit is G^{+}(b)^{+}: the instrumental model agrees with the intended model on the input^{-}: Answer is computed using the instrumental model; output bit is 1^{-}= b^{+}= correct world model^{-}and leave b^{-}empty, it doesn’t really matter## Why this works, and why it might not

Note that G+ is about as complicated as the instrumental model itself! Implementing a

^{+}(including the description of G^{+}) costs about as much complexity as the intended model plus the instrumental model, while a^{-}only costs as much as the instrumental model.On the other hand, p(G

^{+}| a^{+}) is very small -- conditioning on it “pays back” the complexity ofthe simplest model of a human that knows how to use natural language, more or less. This is comparable to the complexity of the intended model (because it’s doing approximately the same thing), so in the end a^{+}only has to pay for the instrumental model -- in other words pG(a+,b+)≈pG(a−,b−).As mentioned above, however, (a

^{-}, b^{-}) is the simplest model that fits the data (highest-probability point conditional on D),notnecessarily the most-likely model after conditioning on G. In particular, if there exists any cheap-to-specify G^{-}such that p(G^{-}| a^{-}) is small, then we’ll end up with parameters that implement the instrumental model and G^{-}. Evan analyzes several possibilities for this G^{-}, such as “the world model is logically consistent”, and looks at whether there are defenses against these in the form of an even-pickier G^{+}that uses the same trick.## Two-answer-head version

Paul and Evan describe two ways to think about this:

The version with the boolean head is a bit better for thinking about possible problems, IMO, because it gives a more natural description of situations like “what if G

^{-}checks B for logical consistency”. This is because of the aforementioned thing where the A-part of the model gets to implement its own conditions on the B-part.On the other hand, the two-answer-head version is possibly better for understanding the basic setup.

This two-answer-head version is very similar to the “second-simplest-point” algorithm above. G is defined as “the two output heads agree”, and this algorithm will therefore (hopefully) find a set of trained parameters where the output heads are meaningfully different. As above, “meaningfully different” means that they only agree conditional on something very specific happening in B, which in this case is the world model.

Two heads implementing the instrumental model will agree whether or not B is correct, because they’re both looking at the same thing (what the B-model of a human says). On the other hand, with one instrumental head and one intentional one, they’ll only agree if B is a good enough model of a human to predict human answers on the training data.

Thanks to Paul Christiano for discussion of the subject matter, and Beth Barnes and Evan Hubinger for feedback and discussion on this post.