Having thought about it more (hopefully with more clarity), I think I have trouble imagining training data for f+ that:
It seems to me that we can be highly confident about matters of fact (how many chairs are in this room...), but less confident once value judgements come into play (which of A or B is the better answer to "How should I go about designing a chair?").[Of course it's not black-and-white: one can make a philosophical argument that all questions are values questions. However, I think this is an issue even if we stick to pragmatic, common-sense approaches.]
I don't think we can remedy this for values questions by including only data that we're certain of. It seems to me that works for facts questions due to the structure of the world: it's so hugely constrained by physical law that you can get an extremely good model by generalizing from sparse data from a different distribution.
It's not clear that anything analogous works for generalizing preferences (maybe?? but I'd guess not). I'd expect an f+ trained on [data we're highly confident is correct] to generalize poorly to general open questions.
Similarly, in Paul's setup I think the following condition will fail if we need to be highly confident of the correctness (relative to what is known) of the small dataset:
It's entirely plausible you can learn "correct language usage" in the narrow sense from consistency on the small dataset (i.e. you may infer a [deduced_statement -> natural_language_equivalent] mapping). I don't think it's plausible you learn it in the sense required (i.e. a [(set_of_all_deduced_statements, Q) -> natural_language_answer] mapping).
Again, perhaps I'm (not even) wrong, but I think the above accurately describes my current thinking.
Ok, the softer constraints make sense to me, thanks.
Using a debate with f+ assessing simple closed questions makes sense, but it seems to me that only moves much of the problem rather than solving it. We start with "answering honestly vs predicting human answers" and end up with "judging honestly vs predicting human judgments".
While "Which answer is better, Alice's or Bob's?" is a closed question, learning to answer the general case still requires applying a full model of human values - so it seems a judge-model is likely to be instrumental (or essentially equivalent: again, I'm not really sure what we'd mean by an intended model for the judge).
But perhaps I'm missing something here; is predicting-the-judge less of a problem than the original? Are there better approaches than using debate which wouldn't have analogous issues?
Ok, I think that makes some sense in so far as you're softening the f+=f− constraint and training it in more open-ended conditions. I'm not currently clear where this gets us, but I'll say more about that in my response to Paul.
However, I don't see how you can use generalization from the kind of dataset where f+ and f− always agree (having asked prescriptive questions). [EDIT: now I do, I was just thinking particularly badly]I see honestly answering a question as a 2-step process (conceptually):1) Decide which things are true.2) Decide which true thing to output.
In the narrow case, we're specifying ((2) | (1)) in the question, and training the model to do (1). Even if we learn a model that does (1) perfectly (in the intended way), it hasn't learned anything that can generalize to (2).Step (2) is in part a function of human values, so we'd need to be giving it some human-values training signal for it to generalize.
[EDIT: I've just realized that I'm being very foolish here. The above suggests that learning (1) doesn't necessarily generalize to (2). In no way does it imply that it can't. I think the point I want to make is that an f+ that does generalize extremely well in this way is likely to be doing some close equivalent to predicting-the-human. (in this I'm implicitly claiming that doing (2) well in general requires full understanding of human values)]
Overall, I'm still unsure how to describe what we want: clearly we don't trust Alice's answers if she's being blackmailed, but how about if she's afraid, mildly anxious, unusually optimistic, slightly distracted, thinking about concept a or b or c...?It's clear that the instrumental model just gives whatever response Alice would give here.I don't know what the intended model should do; I don't know what "honest answer" we're looking for.
If the situation has property x, and Alice has reacted with unusual-for-Alice property y. Do we want the Alice-with-y answer, or the standard-Alice answer? It seems to depend on whether we decide y is acceptable (or even required) w.r.t. answer reliability, given x. Then I think we get the same problem on that question etc.
Thanks for writing this up. It is useful to see a non-Paul perspective on the same ideas, both in terms of clarifying the approach, and eliminating a few of my confusions.A typo: After "or defined in my notation as", you have M+ twice rather than M+ M−
I've not yet been through the details, but it'd be helpful if you'd clarify the starting point and scope a little, since I may well be misunderstanding you (and indeed Paul). In particular on this:
Specifically, f+ is the “honest embedding” which directly converts between logical statements and their equivalent natural language, thus answering questions by embedding q as a logical statement and unembedding its answer in deduced_stmts.
My immediate thought is that in general question answering there is no unique honest unembedding. Much of answer formation is in deciding which information is most relevant, important, useful, tacitly assumed... (even assuming fixed world model and fixed logical deductions).So I assume that you have to mean a narrower context where e.g. the question specifies the logical form the answer must take and the answering human/model assigns values to pre-defined variables.
For a narrower setting, the gist of the post makes sense to me - but I don't currently see how a solution there would address the more general problem. Is finding a prior that works for closed questions with unique honest answers sufficient?
The more general setting seems difficult as soon as you're asking open questions.If you do apply the f+=f− constraint there, then it seems f+ must do hugely more than a simple unembedding from deductions. It'll need to robustly select the same answer as a human from a huge set of honest answers, which seems to require something equivalent to predicting the human. At that point it's not clear to me when exactly we'd want f+ to differ from f− in its later answers (there exist clear cases; I don't see a good general rule, or how you'd form a robust dataset to learn a rule).To put it another way, [honest output to q from fixed world model] doesn't in general uniquely define an answer until you know what the answerer believes the asker of q values.
Apologies if I'm stating the obvious: I'm probably confused somewhere, and wish to double-check my 'obvious' assumptions. Clarifications welcome.
Ok, that all makes sense, thanks.
...and is-correct basically just tests whether they are equal.
So here "equal" would presumably be "essentially equal in the judgement of complex process", rather than verbatim equality of labels (the latter seems silly to me; if it's not silly I must be missing something).
Very interesting, thanks.
Just to check I'm understanding correctly, in step 2, do you imagine the complex labelling process deferring to the simple process iff the simple process is correct (according to the complex process)? Assuming that we require precise agreement, something of that kind seems necessary to me.
I.e. the labelling process would be doing something like this:
# Return a pair of (simple, complex) labels for a given inputsimple_label = GenerateSimpleLabel(input)if is_correct(simple_label, input): return simple_label, simple_labelelse: return simple_label, GenerateComplexLabel(input)
# Return a pair of (simple, complex) labels for a given input
simple_label = GenerateSimpleLabel(input)
if is_correct(simple_label, input): return simple_label, simple_labelelse: return simple_label, GenerateComplexLabel(input)
Does that make sense?
A couple of typos: "...we are only worried if the model [understands? knows?] the dynamics...""...don’t collect training data in situations without [where?] strong adversaries are trying..."
Thanks, that's very helpful. It still feels to me like there's a significant issue here, but I need to think more. At present I'm too confused to get much beyond handwaving. A few immediate thoughts (mainly for clarification; not sure anything here merits response):
I'm going to try writing up the core of my worry in more precise terms.It's still very possible that any non-trivial substance evaporates under closer scrutiny.
I'd be interested in your thoughts on human motivation in HCH and amplification schemes.Do you see motivational issues as insignificant / a manageable obstacle / a hard part of the problem...?
Specifically, it concerns me that every H will have preferences valued more highly than [completing whatever task we assign], so would be expected to optimise its output for its own values rather than the assigned task, where these objectives diverged. In general, output needn't relate to the question/task.[I don't think you've addressed this at all recently - I've only come across specifying enlightened judgement precisely]
I'd appreciate if you could say if/where you disagree with the following kind of argument.I'd like to know what I'm missing:
Motivation seems like an eventual issue for imitative amplification. Even for an H who always attempted to give good direct answers to questions in training, the best models at predicting H's output would account for differing levels of enthusiasm, focus, effort, frustration... based in part on H's attitude to the question and the opportunity cost in answering it directly.
The 'correct' (w.r.t. alignment preservation) generalisation must presumably be in all circumstances to give the output that H would give. In scenarios where H wouldn't directly answer the question (e.g. because H believed the value of answering the question were trivial relative to opportunity cost), this might include deception, power-seeking etc. Usually I'd expect high value true-and-useful information unrelated to the question; deception-for-our-own-good just can't be ruled out.
If a system doesn't always adapt to give the output H would, on what basis do we trust it to adapt in ways we would endorse? It's unclear to me how we avoid throwing the baby out with the bathwater here.
Or would you expect to find Hs for whom such scenarios wouldn't occur? This seems unlikely to me: opportunity cost would scale with capability, and I'd predict every H would have their price (generally I'm more confident of this for precisely the kinds of H I'd want amplified: rational, altruistic...).
If we can't find such Hs, doesn't this at least present a problem for detecting training issues?: if HCH may avoid direct answers or deceive you (for worthy-according-to-H reasons), then an IDA of that H eventually would too. At that point you'd need to distinguish [benign non-question-related information] and [benevolent deception] from [malign obfuscation/deception], which seems hard (though perhaps no harder than achieving existing oversight desiderata???).
Even assuming that succeeds, you wouldn't end up with a general-purpose question-answerer or task-solver: you'd get an agent that does whatever an amplified [model predicting H-diligently-answering-training-questions] thinks is best. This doesn't seem competitive across enough contexts.
...but hopefully I'm missing something.
Unless I've confused myself badly (always possible!), I think either's fine here. The | version just takes out a factor that'll be common to all hypotheses: [p(e+) / p(e-)]. (since p(Tk & e+) ≡ p(Tk | e+) * p(e+))
Since we'll renormalise, common factors don't matter. Using the | version felt right to me at the time, but whatever allows clearer thinking is the way forward.
Taking your last point first: I entirely agree on that. Most of my other points were based on the implicit assumption that readers of your post don't think something like "It's directly clear that 9 OOM will almost certainly be enough, by a similar argument".
Certainly if they do conclude anything like that, then it's going to massively drop their odds on 9-12 too. However, I'd still make an argument of a similar form: for some people, I expect that argument may well increase the 5-8 range more (than proportionately) than the 1-4 range.
On (1), I agree that the same goes for pretty-much any argument: that's why it's important. If you update without factoring in (some approximation of) your best judgement of the evidence's impact on all hypotheses, you're going to get the wrong answer. This will depend highly on your underlying model.
On the information content of the post, I'd say it's something like "12 OOMs is probably enough (without things needing to scale surprisingly well)". My credence for low OOM values is mostly based on worlds where things scale surprisingly well.
But this is a bit weird; my post didn't talk about the <7 range at all, so why would it disproportionately rule out stuff in that range?
I don't think this is weird. What matters isn't what the post talks about directly - it's the impact of the evidence provided on the various hypotheses. There's nothing inherently weird about evidence increasing our credence in [TAI by +10OOM] and leaving our credence in [TAI by +3OOM] almost unaltered (quite plausibly because it's not too relevant to the +3OOM case).
Compare the 1-2-3 coins example: learning y tells you nothing about the value of x. It's only ruling out any part of the 1 outcome in the sense that it maintains [x_heads & something independent is heads], and rules out [x_heads & something independent is tails]. It doesn't need to talk about x to do this.
You can do the same thing with the TAI first at k OOM case - call that Tk. Let's say that your post is our evidence e and that e+ stands for [e gives a compelling argument against T13+].Updating on e+ you get the following for each k:Initial hypotheses: [Tk & e+], [Tk & e-]Final hypothesis: [Tk & e+]
So what ends up mattering is the ratio p[Tk | e+] : p[Tk | e-]I'm claiming that this ratio is likely to vary with k.
Specifically, I'd expect T1 to be almost precisely independent of e+, while I'd expect T8 to be correlated. My reason on the T1 is that I think something radically unexpected would need to occur for T1 to hold, and your post just doesn't seem to give any evidence for/against that.I expect most people would change their T8 credence on seeing the post and accepting its arguments (if they've not thought similar things before). The direction would depend on whether they thought the post's arguments could apply equally well to ~8 OOM as 12.
Note that I am assuming the argument ruling out 13+ OOM is as in the post (or similar).If it could take any form, then it could be a more or less direct argument for T1.
Overall, I'd expect most people who agree with the post's argument to update along the following lines (but smoothly):T0 to Ta: low increase in credenceTa to Tb: higher increase in credenceTb+: reduced credence
with something like (0 < a < 6) and (4 < b < 13).I'm pretty sure a is going to be non-zero for many people.