Answering questions honestly instead of predicting human answers: lots of problems and some solutions

[-]Rohin Shah4y40

except now is checked over all inputs, not just over the dataset (note that we still update on the dataset at the end—it's just our prior which is now independent of it).

Doesn't this mean that the two heads have to be literally identical in their outputs? It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical", which seems basically equivalent to having a single head and generating parameters randomly, so it seems unintuitive that this can do anything useful.

(Disclaimer: I skimmed the post because I found it quite challenging to read properly, so it's much more likely than usual that I failed to understand a basic point that you explicitly said somewhere.)

[-]evhub4y20

It seems like at this point your prior is "generate parameters randomly under the constraint that the two heads are identical"

That's not what the prior looks like—the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.” Thus, you don't need to pay for the complexity of satisfying the condition, only the complexity of specifying it (as long as you're content with the simplest possible way to satisfy it). This is why the two-step nature of the algorithm is necessary—the prior you're describing is what would happen if you used a one-step algorithm rather than a two-step algorithm (which I agree would then not do anything).

[-]Rohin Shah4y40

Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.

the prior is more like “generate parameters that specify some condition, then sample parameters that make that condition true.”

I'm not sure exactly what you mean by the parameters specifying some condition. I thought the condition was specified upfront by the designer (though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters). As far as I can tell, the intended condition is "the two heads are identical" in the dataset-less case. Looking directly at the math, the equations you have are:

θ1∼p(θ1)
θ2∼p(θ2 | θ1)⋅I[∀x∈X. ∀q∈Q. Mθ1,θ2|f?(x,q)]

My interpretation is:

Generate θ1 randomly.
Generate θ2 randomly from θ1, subject to the constraint that the two heads output the same value on all possible inputs.

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters. In which case you could just have generated parameters for the first head, and then copied them over into the second head, rather than go through this complicated setup.

Now, there isn't actually a bijection between model parameters and resulting function. But it seems like the only difference is that you make it more likely that you sample heads which have lots of different implementations in model parameters, i.e. you're doubling the strength of the neural net prior (and that's the only effect). This seems undesirable?

[-]evhub4y40

Hmm, I'm not thinking about the complexity part at all right now; I'm just thinking mechanically about what is implied by your equations.

The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.

though of course to check the condition you need to look at both parameters, so you can view this as the first set of parameters specifying a condition on the second set of parameters

Yep, that's exactly right.

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.

That's definitely not what should happen in that case. Note that there is no relation between and $f_{1}$ or $θ_{2}$ and $f_{2}$ —both sets of parameters contribute equally to both heads. Thus, $θ_{1}$ can enforce any condition it wants on $θ_{2}$ by leaving some particular hole in how it computes $f_{1}$ and $f_{2}$ and forcing $θ_{2}$ to fill in that hole in such a way to make $θ_{1}$ 's computation of the two heads come out equal.

[-]Rohin Shah4y40

The only difference between this setup and normal ML is the prior/complexity—you still have the ability to learn all the same functions, it's just that some are more/less likely now.

Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations until both and $θ_{2}$ are chosen, and then reasoning about the thing you get afterwards.

Note that there is no relation between $θ_{1}$ and $f_{1}$ or $θ_{2}$ and $f_{2}$ —both sets of parameters contribute equally to both heads. Thus, $θ_{1}$ can enforce any condition it wants on $θ_{2}$ by leaving some particular hole in how it computes $f_{1}$ and $f_{2}$ and forcing $θ_{2}$ to fill in that hole in such a way to make $θ_{1}$ 's computation of the two heads come out equal.

Yes, I think I understand that. (I want to note that since $θ_{1}$ is chosen randomly, it isn't "choosing" the condition on $θ_{2}$ ; rather the wide distribution over $θ_{1}$ leads to a wide distribution over possible conditions on $θ_{2}$ . But I think that's what you mean.)

That's definitely not what should happen in that case.

I think you misunderstood what I was claiming. Let me try again, without using the phrase "enforcing the constraint", which I think was the problem.

Imagine there was a bijection between model parameters and resulting function. In Stage 1 you sample $θ_{1}$ randomly. In Stage 2, you sample $θ_{2}$ , such that it fills in the holes in $f_{1}$ and $f_{2}$ to make $f_{1}$ and $f_{2}$ compute the same function. By our bijection assumption, the parameters in $f_{1}$ must be identical to the parameters in $f_{2}$ . Thus, we can conclude the following:

If $θ_{1}$ contained a parameter from $f_{1}$ and $f_{2}$ in the same location (e.g. it includes the weight at position (3, 5) in layer 3 in both $f_{1}$ and $f_{2}$ ), then it must have assigned the same value to both of them.
If $θ_{1}$ contained a parameter from $f_{1}$ and $θ_{2}$ contained the corresponding parameter from $f_{2}$ , then $θ_{2}$ must have set that parameter to the same value as in $θ_{1}$ .
If $θ_{2}$ contained a parameter from $f_{1}$ and $f_{2}$ in the same location, then it must have assigned the same value to both of them.

These constraints are necessary and sufficient to satisfy the overall constraint that $f_{1} = f_{2}$ , and therefore any other parameters in $θ_{2}$ are completely unconstrained and are set according to the original neural net prior.

So it seems to me that (1) any parameters not in $f_{1}$ or $f_{2}$ are set according to the original neural net prior, and (2) parameters in $f_{1}$ must be identical to the corresponding parameters in $f_{2}$ , but their values are chosen according to the neural net prior.

This seems equivalent to having a single head $f_{1}$ , sampling its parameters from the original prior, and then copying those parameters into $f_{2}$ .

I think you should already be pretty worried by the fact that this seems to give weird results when assuming a bijection between model parameters and resulting functions, but let's analyze it without the bijection assumption too:

Since $f_{1}$ and $f_{2}$ have to be identical on all inputs, it doesn't matter what input they get, and therefore there is no constraint on the part of the neural net that is generating the inputs. So, we still get (1): any parameters not in $f_{1}$ or $f_{2}$ are set according to the original neural net prior. (2) is no longer true, but instead of getting that parameters in $f_{1}$ are equivalent to parameters in $f_{2}$ , we get that the function implemented by $f_{1}$ is equivalent to the function implemented by $f_{2}$ . Since ultimately the generating process is "sample parameters until $f_{1} = f_{2}$ ", the probability of getting a particular function $f$ is proportional to the square of the probability of generating parameters for that function $P_{θ \sim O r i g P r i o r} (M_{θ} = f)$ (since you have to successfully generate the function twice). So, you are doubling the strength of the neural net prior in the heads, and leaving the strength the same in the world model (i.e. all parts except for the head).

[-]evhub4y*20

Yeah, sorry, I wasn't clear here -- I meant that, rather than reasoning about the complexity of individual pieces / stages and then adding them all up at the end, I am instead simulating out the equations

Sure, makes sense—theoretically, that should be isomorphic.

I want to note that since is chosen randomly, it isn't "choosing" the condition on $θ_{2}$ ; rather the wide distribution over $θ_{1}$ leads to a wide distribution over possible conditions on $θ_{2}$ . But I think that's what you mean.

This seems like a case where I'm using the more constructive formulation of simulating out the equations and you're thinking about in a more complexity-oriented framing. Of course, again, they should be equivalent.

By our bijection assumption, the parameters in $f_{1}$ must be identical to the parameters in $f_{2}$ .

I'm not sure what you mean by this part— $f_{1}$ and $f_{2}$ are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in $f_{1}$ .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if $f_{1}$ and $f_{2}$ were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.

These constraints are necessary and sufficient to satisfy the overall constraint that $f_{1} = f_{2}$ , and therefore any other parameters in $θ_{2}$ are completely unconstrained and are set according to the original neural net prior.

I'm happy to accept that there are ways of setting $θ_{1}$ (e.g. just make $f_{1}$ and $f_{2}$ identical) such that the rest of the parameters are unconstrained and just use the neural net prior. However, that's not the only way of setting $θ_{1}$ —and not the most complexity-efficient, I would argue. In the defender's argument, $θ_{1}$ sets all the head-specific parameters for both $f_{1}$ and $f_{2}$ to enforce that $f_{1}$ computes $f^{+}$ and $f_{2}$ computes $f^{-}$ , and also sets all the shared parameters for everything other than the human model, while leaving the human model to $θ_{2}$ , thus enforcing that $θ_{2}$ specify a human model that's correct enough to make $f^{+} = f^{-}$ without having to pay any extra bits to do so.

[-]Rohin Shah4y40

I'm not sure what you mean by this part— and $f_{2}$ are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in $f_{1}$ .” I don't think that a bijection assumption between weights and single-head outputs really makes sense in this context. I also definitely would say that if $f_{1}$ and $f_{2}$ were separate models such that they couldn't reuse weights between them, then none of the complexity arguments that I make in the post would go through.

I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads" $f_{1}$ and $f_{2}$ . (I'm pretty sure that's how the term is normally used in ML.) I might benefit from an example architecture diagram where you label what $θ_{1}, θ_{2}, f_{1}, f_{2}$ are.

I did realize that I was misinterpreting part of the math -- the $\forall x, q$ is quantifying over inputs to the overall neural net, rather than to the parts-which-don't-share-weights. My argument only goes through if you quantify the constraint over all inputs to the parts-which-don't-share-weights. Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some $x, q$ (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.

In the defender's argument, $θ_{1}$ sets all the head-specific parameters for both $f_{1}$ and $f_{2}$ to enforce that $f_{1}$ computes $f^{+}$ and $f_{2}$ computes $f^{-}$

This seems to suggest that $f^{+}$ and $f^{-}$ are different functions, i.e. there's some input on which they disagree. But then $θ_{2}$ has to make them agree on all possible $x, q$ . So is the idea that there are some inputs to $f^{+}$ , $f^{-}$ that can never be created with any possible $x, q$ ? That seems... strange (though not obviously impossible).

[-]evhub4y40

I assumed that when you talked about a model with "different heads" you meant that there is a shared backbone that computes a representation, that is then passed through two separate sequences of layers that don't share any weights, and those separate sequences of layers were the "heads" and $f_{2}$ .

Yep, that's what I mean.

Still, assuming that with your desired part-which-shares-weights, every possible input to parts-which-don't-share-weights can be generated by some $x, q$ (which seems like it will be close enough to true), the argument still suggests that conditioning on the desired part-which-shares-weights, you have just doubled the strength of the neural net prior on the parts-which-don't-share-weights.

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing—the only conditioning in the prior is $θ_{2}$ conditioning on $θ_{1}$ . If we look at the intended model, however, $θ_{1}$ includes all of the parts-which-don't-share-weights, while $θ_{2}$ is entirely in the part-which-shares-weights.

Technically, I suppose, you can just take the prior and condition on anything you want—but it's going to look really weird to condition on the part-which-shares-weights having some particular value without even knowing which parts came from $θ_{1}$ and which came from $θ_{2}$ .

I do agree that, if $θ_{1}$ were to specify the entire part-which-shares-weights and leave $θ_{2}$ to fill in the parts-which-don't-share-weights, then you would get exactly what you're describing where $θ_{2}$ would have a doubly-strong neural net prior on implementing the same function for both heads. But that's only one particular arrangement of $θ_{1}$ —there are lots of other $θ_{1}$ s which induce very different distributions on $θ_{2}$ .

This seems to suggest that $f^{+}, f^{-}$ are different functions, i.e. there's some input on which they disagree.

Note that the inputs to $f^{+}, f^{-}$ are deduced statements, not raw data. They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent $X \times Q \to A$ maps.

[-]Rohin Shah4y40

Yep, that's what I mean.

Then I'm confused what you meant by

I'm not sure what you mean by this part— and $f_{2}$ are just different heads, not entirely different models, so I'm not sure what you mean by “the parameters in $f_{1}$ .”

Seems like if the different heads do not share weights then "the parameters in $f_{1}$ " is perfectly well-defined?

Note that conditioning on the part-which-shares-weights is definitely not what the prior is doing

Yeah, sorry, by "conditioning" there I meant "assuming that the algorithm correctly chose the right world model in the end", I wasn't trying to describe a particular step in the algorithm. But in any case I don't think we need to talk about that

They are certainly different functions over the space of all possible deduced statements—but once you put a correct world model in them, they should produce equivalent $X \times Q \to A$ maps.

Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ ? My understanding of $f^{+}$ and $f^{-}$ comes from here:

Specifically, $f^{+}$ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding $q$ as a logical statement and unembedding its answer in $deduced_stmts$ . Conversely, $f^{-}$ is the “mimicry embedding” which just searches for deductions about what a human would say in response to $q$ and outputs that—thus, $f^{-}$ just quotes $q$ , embedding it as just a string of characters for a human to respond to, rather than actually having to understand it in any meaningful way.

If $f^{+}$ and $f^{-}$ produce equivalent $X \times Q \to A$ maps, doesn't that mean that we've just gotten something that can only respond as well as a human? Wouldn't that be a significant limitation? (E.g. given that I don't know German, if my question to the model is "what does <german phrase> mean", does the model have to respond "I don't know"?)

In addition, since the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ , it seems like the world model could never produce decision-relevant deduced statements that the human wouldn't have realized. This seems both (a) hard to enforce and (b) a huge capability hit.

[-]evhub4y20

Seems like if the different heads do not share weights then "the parameters in " is perfectly well-defined?

It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$ , which I think was because you were confused by the quantification, like you said previously. I think we're on the same page now.

Okay, so iiuc you're relying on an assumption (fact? desire?) that the world model will never produce deduced statements that distinguish between $f^{+}$ and $f^{-}$ ?

Sorry, I was unclear about this in my last response. $f^{+}$ and $f^{-}$ will only agree in cases where the human understands what's happening. In the dataset version, we get that by collecting a dataset where we think the human always gets it right, whereas in the dataset-less version, we get that by including the $H_understands$ check which ensures that we don't have to satisfy the condition when the human would get the question wrong.

[-]Rohin Shah4y40

I think I might be missing a change you made to the algorithm. Can write an arbitrary program for $f_{?}$ ? In that case, what prevents you from getting

def M_theta_1_plus(theta_2, x, q):
    axioms = world_model(theta_2=theta_2)(x)
    deduced_stmts = deduction(axioms)
    return {
        "f": f_minus(q, deduced_stmts),
        "f?": True,
    }

It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?

It seemed to me like you were using it in a way such that $f_{1}$ shared no weights with $f_{2}$

I mean, I would still have said this because I interpret a "head" $f_{1}$ as "the part after the shared layers", but I'm also happy to instead treat $f_{1}$ as the entire function $X \times Q \to A$ for which the first head forms part of the implementation.

[-]evhub4y40

Can write an arbitrary program for $f_{?}$ ?

Yes—at least that's the assumption I'm working under.

It seems like this should be lower complexity than the intended result, since True has much lower complexity than H_understands?

I agree that the $θ_{1}$ you've described has lower complexity than the intended $θ_{1}$ —but the $θ_{2}$ in this case has higher complexity, since $θ_{2}$ is no longer getting any of its complexity for free from conditioning on the $f_{?}$ condition. And in fact what you've just described is precisely the unintended model—what I call $M^{-}$ —that I'm trying to compete against, with the hope being that the savings that $M^{+}$ gives you in $θ_{2}$ are sufficient to compensate for the loss in having to specify $f^{+}$ and H_understands in $θ_{1}$ .

If we calculate the complexity of your proposal, we get $\begin{matrix} complexity (M^{-}) = complexity (θ_{1}^{-}) + complexity (θ_{2}^{-} | M^{-} |_{f_{?}}) = complexity (W - H) + complexity (f^{-}) + complexity (H | True) = complexity (W - H) + complexity (f^{-}) + complexity (H) \approx complexity (W) \end{matrix}$ whereas, if we calculate the complexity of the intended $M^{+}$ , we get $complexity(M+)=complexity(θ+1)+complexity(θ+2 | M+|f?)=complexity(W−H)+complexity(f−)+complexity(f+)+complexity(H_understands)+complexity(H | H_understands→f+=f−)≈complexity(W−H)+complexity(f+)+complexity(H_understands)+complexity(H)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}≈complexity(W)+complexity(f+)+complexity(H_understands)−minθ2{complexity(θ2) | H_understandsH=θ2→f+H=θ2=f−H=θ2}$ such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on $H_understands \to f^{+} = f^{-}$ offsets the cost of having to specify $f^{+}$ and $H_understands$ .

[-]Rohin Shah4y40

such that you can see that the question of which one wins is precisely dependent on whether the savings from conditioning on offsets the cost of having to specify $f^{+}$ and $H_understands$ .

Yeah, that makes sense. I guess I don't really see the intuition about why this should be true, but fair enough to leave that as an open question.

[-]hogwash94y*00

Imagine there was a bijection between model parameters and resulting function. (I'm aware this is not at all true.) In that case it seems like you are enforcing the constraint that the two heads have identical parameters.

AFAIK, I always imagined the idea behind this objective function to be quite similar to contrastive learning, where you have two networks (or equivalently two sets of parameters), and the goal is to maximize agreement for pairs of inputs to each network that have the same ground truth class/label (conversely maximize disagreement for pairs that are different). That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned (where both networks are learning the same mapping, such that there's less benefit to propagating any examples through two networks), which makes this something that we want to minimize. In practice, though, this has been found to occur rarely (c.f. [1]).

Nonetheless, since reading Paul's statement about the problem of the instrumental model, I've been thinking about issues that might arise with the proposed solution, even though similar approaches (i.e. the contrastive training objective) have proven effective for robustness in general (e.g. against adversarial perturbations, data limited scenarios). If I were committed to this stance, I would agree somewhat with the desire to explore alternatives, and I have thought about the extent to which some sort of reconstruction loss could be introduced; this is where the goal might instead be to "maximize agreement" with a set of non-trivial observations/facts that are guaranteed to be more "objective" (somehow) than the original training data (one inspiration being that reconstruction losses in vision deep learning papers like this one often turn out to be good regularizers). So far I haven't had any promising proposals come to light for generative LM.

I am still holding onto the thought, given the remote possibility that all of my above assumptions are correct, and also because "generative models" might reflect the ideal approach to unsupervised learning, whereas "contrastive learning" is sometimes seen as a sort of compromise since (unlike generative models) it's amenable to limited compute [2].

[-]Rohin Shah4y00

That in mind, there are various papers (e.g.) that explore the possibility of "collapsed" solutions like the one you mentioned

I haven't read the paper, but in contrastive learning, aren't these solutions prevented by the negative examples?

[-]hogwash94y00

It makes sense that negative pairs would help to a large extent, but not all contrastive papers used negative examples, like BYOL (ref). Edit: but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.

[-]Rohin Shah4y00

If memory serves, with BYOL you are using current representations of an input to predict representations of a related input $x_{2}$ , but the representation of $x_{2}$ comes from an old version of the encoder. So, as long as you start with a non-collapsed initial encoder, the fact that you are predicting a past encoder which is non-collapsed ensures that the current encoder you learn will also be non-collapsed.

(Mostly my point is that there are specific algorithmic reasons to expect that you don't get the collapsed solutions, it isn't just a tendency of neural nets to avoid collapsed solutions.)

but now I'm realizing that this might no longer fit the definition of contrastive learning (instead just ordinary self supervised learning), so I apologize about the error/confusion in that case.

No worries, I think it's still a relevant example for thinking about "collapsed" solutions.

[-]Joe Collman4y40

Thanks for writing this up. It is useful to see a non-Paul perspective on the same ideas, both in terms of clarifying the approach, and eliminating a few of my confusions.

A typo: After "or defined in my notation as", you have twice rather than $M^{+}$ $M^{-}$

I've not yet been through the details, but it'd be helpful if you'd clarify the starting point and scope a little, since I may well be misunderstanding you (and indeed Paul). In particular on this:

Specifically, $f^{+}$ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding $q$ as a logical statement and unembedding its answer in $deduced_stmts$ .

My immediate thought is that in general question answering there is no unique honest unembedding. Much of answer formation is in deciding which information is most relevant, important, useful, tacitly assumed... (even assuming fixed world model and fixed logical deductions).
So I assume that you have to mean a narrower context where e.g. the question specifies the logical form the answer must take and the answering human/model assigns values to pre-defined variables.

For a narrower setting, the gist of the post makes sense to me - but I don't currently see how a solution there would address the more general problem. Is finding a prior that works for closed questions with unique honest answers sufficient?

The more general setting seems difficult as soon as you're asking open questions.
If you do apply the $f^{+} = f^{-}$ constraint there, then it seems $f^{+}$ must do hugely more than a simple unembedding from deductions. It'll need to robustly select the same answer as a human from a huge set of honest answers, which seems to require something equivalent to predicting the human. At that point it's not clear to me when exactly we'd want $f^{+}$ to differ from $f^{-}$ in its later answers (there exist clear cases; I don't see a good general rule, or how you'd form a robust dataset to learn a rule).
To put it another way, [honest output to q from fixed world model] doesn't in general uniquely define an answer until you know what the answerer believes the asker of q values.

Apologies if I'm stating the obvious: I'm probably confused somewhere, and wish to double-check my 'obvious' assumptions. Clarifications welcome.

[-]paulfchristiano4y50

I don't think you actually want to use supervised training for training , you want to use feedback of the form "Is this answer much wronger than that answer?" and then train the model to not produce definitely-wrong answers.
Likewise the $f^{+} = f^{-}$ constraint would really want to be something softer (e.g. forcing $f^{+}$ to give plausible-looking answers to questions as evaluated by $f^{-}$ ).
I think that most questions about what is useful / tacitly assumed / etc. can be easily handled on top of the "raw" ability to elicit the model's knowledge (if you like you could imagine having a debate about which answer is better all things considered, using $f^{+}$ to assess the model's beliefs about closed question)
I do think there are a lot of problems along these lines that you'd want to think about a bunch in theory, and then later need to do a bunch of empirical work on. But unfortunately I also think there are a lot of "bigger fish to fry" that are very likely to sink this entire family of approaches. So the first order of business is understanding those and wandering our way to a general category of solution that might actually work.

[-]Joe Collman4y10

Ok, the softer constraints make sense to me, thanks.

Using a debate with assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with "answering honestly vs predicting human answers" and end up with "judging honestly vs predicting human judgments".

While "Which answer is better, Alice's or Bob's?" is a closed question, learning to answer the general case still requires applying a full model of human values - so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I'm not really sure what we'd mean by an intended model for the judge).

But perhaps I'm missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn't have analogous issues?

[-]evhub4y40

I mostly agree with what Paul said re using various techniques to improve the evaluation of to ensure you can test it on more open-ended questions. That being said, I'm more optimistic that, if you can get the initial training procedure right, you can rely on generalization to fill in the rest. Specifically, I'm imagining a situation where the training dataset is of the narrower form you talk about such that $f^{+}$ and $f^{-}$ always agree (as in Step 3 here)—but where the deployment setting wouldn't necessarily have to be of this form, since once you're confident that you've actually learned $f^{+}$ and not e.g. $f^{-}$ , you can use it for all sorts of things that wouldn't ever be in that training dataset (the hard part, of course, is ever actually being confident that you did in fact learn the intended model).

(Also, thanks for catching the typo—it should be fixed now.)

[-]Joe Collman4y30

Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for that:

We're highly confident is correct.
Enables the model to decide which true things to output in general. (my (2) here)

It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to "How should I go about designing a chair?").
[Of course it's not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]

I don't think we can remedy this for values questions by including only data that we're certain of. It seems to me that works for facts questions due to the structure of the world: it's so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.

It's not clear that anything analogous works for generalizing preferences (maybe?? but I'd guess not). I'd expect an $f^{+}$ trained on [data we're highly confident is correct] to generalize poorly to general open questions.

Similarly, in Paul's setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:

The small dataset is still rich enough that you could infer correct language usage from it, i.e. the consistency condition on the small dataset alone suffices to recover all 10,000 bits required to specify the intended model.

It's entirely plausible you can learn "correct language usage" in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement -> natural_language_equivalent] mapping). I don't think it's plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) -> natural_language_answer] mapping).

Again, perhaps I'm (not even) wrong, but I think the above accurately describes my current thinking.

[-]Joe Collman4y*10

Ok, I think that makes some sense in so far as you're softening the constraint and training it in more open-ended conditions. I'm not currently clear where this gets us, but I'll say more about that in my response to Paul.

However, I don't see how you can use generalization from the kind of dataset where $f^{+}$ and $f^{-}$ always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]
I see honestly answering a question as a 2-step process (conceptually):
1) Decide which things are true.
2) Decide which true thing to output.

In the narrow case, we're specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn't learned anything that can generalize to (2).
Step (2) is in part a function of human values, so we'd need to be giving it some human-values training signal for it to generalize.

[EDIT: I've just realized that I'm being very foolish here. The above suggests that learning (1) doesn't necessarily generalize to (2). In no way does it imply that it can't. I think the point I want to make is that an $f^{+}$ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I'm implicitly claiming that doing (2) well in general requires full understanding of human values)]

Overall, I'm still unsure how to describe what we want: clearly we don't trust Alice's answers if she's being blackmailed, but how about if she's afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?
It's clear that the instrumental model just gives whatever response Alice would give here.
I don't know what the intended model should do; I don't know what "honest answer" we're looking for.

If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.

[-]Charlie Steiner4y10

I'm having some formatting problems (reading on lesswrong.com in firefox) with scroll bars under full-width LaTex covering the following line of text.

(So now I'm finishing reading it on greaterwrong.)

It's worth flagging that the zero loss assumption is somewhat questionable if we don't expect to train to convergence—but it's at least a plausible assumption, it makes the analysis a lot easier, and I don't expect it to be hiding major issues, so it seems fine at least for the purposes of this post. ↩︎
In an unbounded compute setting, the chain rule of conditional entropy gives us that $complexity (A, B) = complexity (A) + complexity (B | A) .$ However, if $A$ can be a one-way function of $B$ , then in general we just get the inequality $complexity (A, B) \leq complexity (A) + complexity (B | A) .$ Throughout this post, however, we'll make use of the full approximate equality $complexity (world_model, f) \approx complexity (world_model) + complexity (f | world_model),$ where the hope is that this should make sense given that, in the neural network setting, $f^{+}, f^{-}$ would need to be near the end of the network, and thus should just be functions of $world_model$ . Additionally, given that we expect $world_model$ to be significantly more complex than $f^{+}$ or $f^{-}$ , even in general we shouldn't be changing much by doing this. ↩︎
Paul's relaxation is to translate $I [\forall (x, q,_) \in D . M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q) = M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q)]$ into $exp (- E [| M_{θ_{1}, θ_{2}} |_{f_{1}} (x, q) - M_{θ_{1}, θ_{2}} |_{f_{2}} (x, q) | (x, q,_) \sim D |^{2}]) .$ ↩︎
Note that this assumption is somewhat sketchy. Paul conjectures that this approximation is only ever off by a constant factor, though that's not necessarily very comforting if we don't have an estimate for the size of that factor, nor a proof of that conjecture. In general, we only get the inequality $complexity (A) - min A^{'} {complexity (A^{'}) | P} \leq complexity (A | P) \leq complexity (A) .$ Fortunately, we'll mostly just be using this assumption as an intuition pump, with most of the analysis working just fine without it. When we do lean on it more heavily, it'll only be in the direction where we're actually guaranteed the inequality. ↩︎
While $θ_{2}^{-'} = H_{brains = rocks}$ doesn't work for this, there is a way to use the rocks for brains problem to construct an attack in the same vein as our previous attacks where we build an $M^{-'}$ with lower complexity than $M^{+}$ . Let $M^{-'} = M_{θ_{1}^{+}, θ_{2}^{-'}}$ . Then, since the output head in $θ_{1}^{+}$ just runs $f^{+}$ , that means we need $θ_{2}^{-'}$ to provide a detailed enough picture of how humans work to enable $f^{+}$ to answer any questions about humans in the dataset correctly—but it need not be any more detailed than that. In particular, the human model need not be detailed enough to ensure anything about non-human-related inputs, so long as it can ensure that $H_understands$ is always false for such inputs. Thus, let $H_{θ_{2}^{-'}} (x, q) = H - H (\neg H_related) if H_related (x, q) else H_{brains = rocks}$ where $H_related (x, q)$ determines if the inputs require knowledge of humans, $H (\neg H_related)$ are the parts of $H$ that are only necessary to compute $H$ 's behavior on non-human-related inputs (such that $H - H (\neg H_related)$ is everything necessary for $H_related$ inputs), and $H_{brains = rocks}$ is a human that understands nothing (such that $H_understands$ is always false). The idea here is that, for such a $θ_{2}^{-'}$ , we should get ${H_understands}_{H = θ_{2}^{-'}} \to H_related$ . Then, calculating $complexity (θ_{2}^{-'} | θ_{1}^{+}, \forall X . H_understands \to f^{+} = f^{-})$ , we get $comp(θ− ′2 | θ+1, ∀X. H_understands→f+=f−)=comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)+comp(Hbrains=rocks | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}≈comp(H−H(¬H_related) | θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ which, assuming that we can specify $H (\neg H_related)$ after $H - H (\neg H_related)$ without gaining complexity, becomes $≈comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ and since this attack leaves $θ_{1}^{+}$ alone, we need only compare to $θ_{2}^{+}$ , which has $comp(θ+2)=comp(H | θ+1, ∀X. H_understands→f+=f−)≈comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}$ such that we get $comp (θ_{2}^{-'} | θ_{1}^{+}) < comp (θ_{2}^{+} | θ_{1}^{+})$ if and only if $comp(θ− ′2 | θ+1)<comp(θ+2 | θ+1)comp(H | θ+1)−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}<comp(H | θ+1)−minθ2{comp(θ2 | θ+1) | ∀X. H_understandsH=θ2→f+H=θ2=f−H=θ2}−comp(H(¬H_related) | H−H(¬H_related), θ+1)+comp(H_related | H−H(¬H_related), θ+1)<0comp(H_related | H−H(¬H_related), θ+1)<comp(H(¬H_related) | H−H(¬H_related), θ+1).$ Then, the idea is that $H_related$ should be pretty straightforward, since it doesn't need to do much more than check whether $world_model (x)$ makes use of $H$ —and removing the need to specify $H (\neg H_related)$ should be a big complexity bonus, since it removes the need to encode any general human beliefs about the world that aren't directly relevant to answering questions about other humans. ↩︎
Note that a similar analysis to that given for $θ_{2}^{-} = H - H (E) + avoids (E)$ can also be given for $θ_{2}^{-} = H - H (\neg H_related) if H_related else H_{brains = rocks}$ , the rocks for brains example that does fit the dataset as given in a previous footnote. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

35

Answering questions honestly instead of predicting human answers: lots of problems and some solutions

35

Setting up the problem

Solution attempts

Simplicity prior

Paul's original proposal

Defender

Attacker

Bit-splitting

Composition with consistency

Dataset conditioning

Fixing dataset conditioning by removing the dataset

Defender

Attacker

Dataset conditioning

Composition with signature checking

Understandability checking

Event-avoiding humans

Humans with rocks for brains

Directions for future work

The bit-splitting problem

The understandability checking problem

The event-avoidance problem

The composition problem

The rocks for brains problem

Conclusion