Teaching ML to answer questions honestly instead of predicting human answers

[-]A Ray5y40

I feel overall confused, but I think that's mostly because of me missing some relevant background to your thinking, and the preliminary/draft nature of this.

I hope sharing my confusions is useful to you. Here they are:

I'm not sure how the process of "spending bits" works. If the space of possible models was finite and discretized, then you could say spending bits is partitioning down to "1/2^B"th of the space -- but this is not at all how SGD works, and seems incompatible with using SGD (or any optimizer that doesn't 'teleport' through parameter space) as the optimization algorithm.

Spending bits does make sense in terms of naive rejection sampling (but I think we agree this would be intractably expensive) and other cases of discrete optimization like integer programming. It's possible I would be less confused if this was explained using a different optimization algorithm, like BFGS or some hessian-based method or maybe a black-box bayesian solver.

Separately, I'm not sure why the two heads wouldn't just end up being identical to each other. In shorter-program-length priors (which seem reasonable in this case; also minimal-description-length and sparse-factor-graph, etc etc) it seems like weight-tying the two heads or otherwise making them identical.

Lastly, I think I'm confused by your big formula for the unnormalized posterior log probability of (, $θ_{2}$ ) -- I think the most accessible of my confusions is that it doesn't seem to pass "basic type checking consistency".

I know the output should be a log probability, so all the added components should be logprobs/in terms of bits.

The L() term makes sense, since it's given in terms of bits.

The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs. Maybe they both just have some implicit unit conversion parameter out front, but I think it'd be surprising if it were the case that every "1 parameter unit" move through parameter space is worth "1 nat" of information. For example, it's intuitive to me that some directions (towards zero) would be more likely than other directions.

The C() term has a lagrange multiplier, which I think are usually unitless. In this case I think it's safe to say it's also maybe doing units conversion. C() itself seems to possibly be in terms of bits/nats, but that isn't clear.

In normal lagrangian constrained optimization, lambda would be the parameter that gives us the resource tradeoff "how many bits of (L) loss on the data set tradeoff with a single bit (C) of inconsistency"

Finally the integral is a bit tricky for me to follow. My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.

(I realize that unit-type-checking ML is pretty uncommon and might just be insane, but it's one of the ways I try to figure out what's going on in various algorithms)

Looking forward to reading more about this in the future.

[-]Rohin Shah4y20

I realize that unit-type-checking ML is pretty uncommon and might just be insane

Nah, it's a great trick.

The two parameter distances seem like they're in whatever distance metric you're using for parameter space, which seems to be very different from the logprobs.

The trick here is that L2 regularization / weight decay is equivalent to having a Gaussian prior on the parameters, so you can think of that term as (minus an irrelevant additive constant), where $σ$ is set to imply whatever hyperparameter you used for your weight decay.

This does mean that you are committing to a Gaussian prior over the parameters. If you wanted to include additional information like "moving towards zero is more likely to be good" then you would not have a Gaussian centered at $θ_{0}$ , and so the corresponding log prob would not be the nice simple "L2 distance to $θ_{0}$ ".

My admittedly-weak physics intuitions are usually that you only want to take an exponential (or definitely a log-sum-exp like this) of unitless quantities, but it looks like it has the maybe the unit of our distance in parameter space. That makes it weird to integrate over possible parameter, which introduces another unit of parameter space, and then take the logarithm of it.

I think this intuition is correct, and the typical solution in ML algorithms is to empirically scale all of your quantities such that everything works out (which you can interpret from the unit-checking perspective as "finding the appropriate constant to multiply your quantities by such that they become the right kind of unitless").

[-]Joe Collman5y30

Very interesting, thanks.

Just to check I'm understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.

I.e. the labelling process would be doing something like this:

# Return a pair of (simple, complex) labels for a given input
simple_label = GenerateSimpleLabel(input)
if is_correct(simple_label, input):
return simple_label, simple_label
else:
return simple_label, GenerateComplexLabel(input)

Does that make sense?

A couple of typos:
"...we are only worried if the model [understands? knows?] the dynamics..."
"...don’t collect training data in situations ~~without~~ [where?] strong adversaries are trying..."

[-]paulfchristiano5y30

I think "deferring" was a bad word for me to use. I mostly imagine the complex labeling process will just independently label data, and then only include datapoints when there is agreement. That is, you'd just always return the (simple, complex) pair, and is-correct basically just tests whether they are equal.

I said "defer" because one of the data that the complex labeling process uses may be "what a human who was in the room said," and this may sometimes be a really important source of evidence. But that really depends on how you set things up, if you have enough other signals then you would basically always just ignore that one.

(That said, I think probably amplification is the most important difference between the simple and complex labeling processes, because that's the only scalable way to inject meaningful amounts of extra complexity into the complex labeling process---since the ML system can't predict itself very well, it forces it to basically try to win a multiplayer game with copies of itself, and we hope that's more complicated. And if that's the case then the simple labeling process may as well use all of the data sources, and the difference is just how complex a judgment we are making using those inputs.)

[-]Joe Collman5y10

Ok, that all makes sense, thanks.

...and is-correct basically just tests whether they are equal.

So here "equal" would presumably be "essentially equal in the judgement of complex process", rather than verbatim equality of labels (the latter seems silly to me; if it's not silly I must be missing something).

[-]paulfchristiano5y30

I think they need to be exactly equal. I think this is most likely accomplished by making something like pairwise judgments and only passing judgment when the comparison is a slam dunk (as discussed in section 3). Otherwise the instrumental policy will outperform the intended policy (since it will do the right thing when the simple labels are wrong).

[-]Rohin Shah4y20

Planned summary for the Alignment Newsletter:

This post presents an algorithm that aims to solve the second problem from the highlighted post. As a reminder, the second problem is that an AI system that already has to make predictions about humans might learn a policy that is just “say what humans would say”, since that is simpler than learning another translation that maps its knowledge to human language (so that it can answer honestly to the best of its knowledge).
The core idea is to have a “simple” labeling process and a “complex” labeling process, where the complex process can catch errors from the simple process. We’ll also assume we have a simple / complex dataset (labeled by the simple / complex processes respectively), where we are confident that all of the answers in the dataset are correct. This means that the complex dataset has more challenging questions than the simple one. The _simple / complex instrumental policy_ is the policy that predicts what the simple / complex labeling process would do.
As an inefficient first algorithm, we can train our model on answers from the simple process where we are confident in their correctness, and then simply check whether the model performs well on new, harder questions where the complex and simple processes disagree. If the model fails this check, we restart from scratch. Intuitively, when the model eventually passes the check, it has probably learned the intended policy, as that’s the one which would generalize zero-shot to complex questions as well (and in particular, the simple instrumental policy would _not_ generalize in this way).
Of course, we need a more efficient algorithm than that. Instead of creating an algorithm where the intended policy would work but the instrumental policy wouldn’t, could we instead change the inductive biases so that the intended policy is _likely_ while the instrumental policy is not? Our original worry was that the intended policy has to bake in two implementations of language -- one in its world model, and one when translating answers into human-understandable concepts. So we could instead try to train a model that learns language from the simple instrumental policy, but is also trained on the complex dataset. The hope would be that the intended policy can learn the second implementation of language “for free” from the simple instrumental policy, while still working on the complex dataset. The actual details are quite complex and I’m not going to go into them here.
[This post](https://www.alignmentforum.org/posts/gEw8ig38mCGjia7dj/answering-questions-honestly-instead-of-predicting-human) by Evan Hubinger points out some problems and potential solutions with the approach.

[-]Rohin Shah4y20

Some confusions I have:

Why does need to include part of the world model? Why not instead have $θ_{1}$ be the parameters of the two heads, and $θ_{2}$ be the parameters of the rest of the model?

This would mean that you can’t initialize $θ_{2}$ to be equal to $θ_{1}$ , but I don’t see why that’s necessary in the first place -- in particular it seems like the following generative model should work just fine:

$P (θ_{1}) \propto exp (- | | θ_{1} - θ_{1, i n i t} | |^{2})$

$P (θ_{2} ∣ θ_{1}) \propto exp (- λ C (θ_{1}, θ_{2}) - | | θ_{2} - θ_{2, i n i t} | |^{2})$

(I’ll be thinking of this setup for the rest of my comment, as it makes more sense to me)

When differentiating the consistency test C we should treat the intended head as fixed rather than differentiating through it. This removes SGD’s incentive to achieve consistency by e.g. making sure the world is simple and so all questions have simple answers.

Hmm, why is this necessary? It seems like the whole point of $L (θ_{2})$ is to ensure that you have to learn a detailed world model that gets you the right answers. I guess as $λ \to \infty$ , that doesn't really help you, but really you shouldn't have $λ \to \infty$ because you shouldn't expect to be able to have $C (θ_{1}, θ_{2}) \to 0$ .

(Also, shouldn't that be $L (θ_{1}, θ_{2})$ , since it is $θ_{1}$ and $θ_{2}$ together that compute answers to questions?)

[-]Gordon Seidoh Worley5y20

I want to consider models that learn to predict both “how a human will answer question Q” (the instrumental model) and “the real answer to question Q” (the intended model). These two models share almost all of their computation — which is dedicated to figuring out what actually happens in the world. They differ only when it comes time to actually extract the answer. I’ll describe the resulting model as having a “world model,” an “instrumental head,” and an “intended head.”

This seems massively underspecified in that it's really unclear to me what's actually different between the instrumental and intended models.

I say this because you posit the intended model gives "the real answer", but I don't see a means offered by which to tell "real" answers from "fake" ones. Further, for somewhat deep philosophical reasons, I also don't expect there is any such thing as a "real" answer anway, only one that is more or less useful to some purpose, and since ultimately it's humans setting this all up, any "real" answer is ultimately a human answer.

The only difference I can find seems to be a subtle one about whether or not you're directly or indirectly imitating human answers, which is probably relevant for dealing with a class of failure modes like overindexing on what humans actually do vs. what we would do if we were smarter, knew more, etc. but also still leaves you human imitation since there's still imitation of human concerns taking place.

Now, that actually sounds kinda good to me, but it's not what you seem to be explicitly saying when you talk about the instrumental and intended model.

[-]paulfchristiano5y30

I don't think anyone has a precise general definition of "answer questions honestly" (though I often consider simple examples in which the meaning is clear). But we do all understand how "imitate what a human would say" is completely different (since we all grant the possibility of humans being mistaken or manipulated), and so a strong inductive bias towards "imitate what a human would say" is clearly a problem to be solved even if other concepts are philosophically ambiguous.

Sometimes a model might say something like "No one entered the datacenter" when what they really mean is "Someone entered the datacenter, got control of the hard drives with surveillance logs, and modified them to show no trace of their presence." In this case I'd say the answer is "wrong;" when such wrong answers appear as a critical part of a story about catastrophic failure, I'm tempted to look at why they were wrong to try to find a root cause of failure, and to try to look for algorithms that avoid the failure by not being "wrong" in the same intuitive sense. The mechanism in this post is one way that you can get this kind of wrong answer, namely by imitating human answers, and so that's something we can try to fix.

On my perspective, the only things that are really fundamental are:

Algorithms to train ML systems. These are programs you can run.
Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

Everything else is just a heuristic to help us understand why an algorithm might work or where we might look for a possible failure story.

I think this is one of the upsides of my research methodology---although it requires people to get on the same page about algorithms and about predictions (of the form "X could happen"), we don't need to start on the same page about all the other vague concepts. Instead we can develop shared senses of those concepts over time by grounding them out in concrete algorithms and failure stories. I think this is how shared concepts are developed in most functional fields (e.g. in mathematics you start with a shared sense of what constitutes a valid proof, and then build shared mathematical intuitions on top of that by seeing what successfully predicts your ability to write a proof).

[-]Gordon Seidoh Worley5y10

Stories about how those algorithms lead to bad consequences. These are predictions about what could/would happen in the world. Even if they aren't predictions about what observations a human would see, they are the kind of thing that we can all recognize as a prediction (unless we are taking a fairly radical skeptical perspective which I don't really care about engaging with).

In the spirit then of caring about stories about how algorithms lead to bad consequences, a story about how I see not making a clear distinction between instrumental and intended models might come to bite you.

Let's use your example of a model that reports "no one entered the data center". I might think the right answer is that "no one entered the data center" when I in fact know that physically someone was in the datacenter but they were an authorized person. If I'm reporting this in the context of asking about a security breach, saying "no one entered the data center" when I more precisely mean "no unauthorized person entered the data center" might be totally reasonable.

In this case there's some ambiguity about what reasonably counts as "no one". This is perhaps somewhat contrived, but category ambiguity is a cornerstone of linguistic confusion and where I see the division between instrumental and intended models breaking down. I think there are probably some chunk of things we could screen off by making this distinction that are obviously wrong (e.g. the model that tries to tell me "no one entered the data center" when in fact, even given my context of a security breach, some unauthorized person did entered the data center), and that seems useful, so I'm mainly pushing on the idea here that your approach here seems insufficient for addressing alignment concerns on its own.

Not that you necessarily thought it was, but this seems like the relevant kind of issue to want to consider here.

[-]adamShimi4y10

Reading this thread, I wonder if the apparent disagreement doesn't come from the use of the world "honestly". The way I understand Paul's statement of the problem is that "answer questions honestly" could be replaced by "answer questions appropriately to the best of your knowledge". And his point is that "answer what a human would have answered" is not a good proxy for that (yet still an incentivized one due to how we train neural nets)

From my reading of it, this post's proposal does provide some plausible ways to incentivize the model to actual search for appropriate answers instead of the ones human would have given, and I don't think it assumes the existence of true categories and/or essences.

[-]John Schulman5y10

Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior? Also, I don't see what this objective has to do with learning a world model.

[-]paulfchristiano5y30

Also, I don't see what this objective has to do with learning a world model.

The idea is to address a particular reason that your learned model would "copy a human" rather than "try to answer the question well." Namely, the model already contains human-predictors, so building extra machinery to answer questions (basically translating between the world model and natural language) would be more inefficient than just using the existing human predictor. The hope is that this alternative loss allows you to use the translation machinery to compress the humans, so that it's not disfavored by the prior.

I don't think it's intrinsically related to learning a world model, it's just an attempt to fix a particular problem.

To the extent that there is a problem with the proposed approach---either a reason that this isn't a real problem in the standard approach, or a reason that this proposed approach couldn't address the problem (or would inevitably introduce some other problem)---then I'm interested in that.

Isn't the Step 1 objective (the unnormalized posterior log probability of (θ₁, θ₂)) maximized at θ₁ = θ₂=argmax L + prior?

Why would it be maximized there? Isn't it at least better to make ?

And then in the section I'm trying to argue that the final term (the partition function) in the loss means that you can potentially get a lower loss by having $θ_{1}$ push apart the two heads in such a way that improving the quality of the model pushes them back together. I'm interested in anything that seems wrong in that argument.

(I don't particularly believe this particular formulation is going to work, e.g. because the L2 regularizer pushes θ₁ to adjust each parameter halfway, while the intuitive argument kind of relies on it being arbitrary what you put into θ₁ or θ₂, as it would be under something more like an L1 regularizer. But I'm pretty interested in this general approach.)

Two caveats were: (i) this isn't going to actually end up actually making any alternative models lower loss, it's just going to level the playing field such that a bunch of potential models have similar loss (rather than an inductive bias in favor of the bad models), (ii) in order for that to be plausible you need to have a stop grad on one of the heads in the computation of C, I maybe shouldn't have push that detail so late.

[-]John Schulman5y40

D'oh, re: the optimum of the objective, I now see that the solution is nontrivial. Here's my current understanding.

Intuitively, the MAP version of the objective says: find me a simple model theta1 such that there's more-complex theta2 with high likelihood under p(theta2|theta1) (which corresponds to sampling theta2 near theta1 until theta2 satisfies head-agreement condition) and high data-likelihood p(data|theta2).

And this connects to the previous argument about world models and language as follows: we want theta1 to contain half a world model, and we want theta2 to contain the full world model and high data-likelihood (for one of the head) and the two heads agree. Based on Step1, the problem is still pretty underconstrained, but maybe that's resolved in Step 2.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

36

Teaching ML to answer questions honestly instead of predicting human answers

36

The problem

Aside on imitative generalization

Step 1: make the intended model pay for itself

Intuition

Rough plan

Why might this work?

The actual algorithm

Step 2: Give the intended model a leg up

Step 3: Make the training set good enough