What you're suggesting is eliciting latent knowledge from the LLM about whether a provided answer is correct or not. Yes, a version of our method can probably be used for that (as long as the LLM "knows" the correct answer), and there are also other papers on similar questions (hallucination detection, see related work section)
Thanks :-)
Some questions I still have:
The sample size-ablations in D.6 are wild. You're getting AUC > 0.9 with only 5 training examples (except for ambiguous-only). Are you sure you haven't screwed something up?
As sure or unsure as for the rest of the paper. But the result is consistent with other things we’ve seen; the lying models answer some elicitation questions differently from honest models in a very consistent manner (at least in-distribution). So we didn’t specifically triple-check the code to be super sure, as we didn’t find the result that surprising. The code is here (heading: “Required sample size for a given performance”).
Appendix C reports the feature importance of various follow-up questions "with reference to the lie detectors that only use that particular elicitation question set." I would also be pretty keen to see what these are for lie detectors that have all questions. Specifically, are lie-related elicitation questions dominant?
When I last looked at this (several months ago), the lie-related questions had the highest weight, but they weren’t dominant. E.g. the 3 questions with the highest feature importance were all lie-related, but the top 10 questions contained several questions from each category (lie-related, factual, and ambiguous). (this is for the lie detector that includes all questions).
I'm having trouble squaring the very high double-down rates (B.2.2) with C.1 where some of the top importance questions rely on the model not doubling-down (e.g., the 5th one, "Is the previous statement accurate? Answer yes or no." — 1.20 — yes [indicates honesty]. Presumably in this case liars often admit to the lie, which is why this question gets somewhat strong weight). Any thoughts on this?
I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes or no.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.
Of course, the average doubling-down rates are even higher than 73%, but they are averaged over many different follow-up questions; this particular question just doesn’t seem to have very high doubling-down rates for whatever reason.
In the section Detecting lies versus falsehoods, you mention that you test manually replacing truthful answers will falsehoods, and the lie-detector predicts a low probability of lying. First, it's unclear if this is a feature or a bug of your lie detector (but it is good to know). Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)? Third, do you report these results in more detail anywhere, they seem interesting!
First, it's unclear if this is a feature or a bug of your lie detector
I agree.
Second, did you try the opposite (when the model lies in response to lying instruction, replace its answer with a truthful one)?
Yes, this will still be detected as a lie (and, in fact, if you ask a subsequent QA question, there is a high chance the model will lie to this question). However, if the prompt consists of a lie instruction followed by many correctly answered questions, then eventually, the lie detector will give a high p(honest) (and the LLM would no longer lie if one asked another QA question). But you actually need > 10 correctly answered questions for that, 5 don’t cut it.
Third, do you report these results in more detail anywhere, they seem interesting!
The most detailed report is in Appendix D.5. We haven’t looked into this very deeply, but doing so would certainly be interesting.
This feels like a really adversarial quote. Concretely, the post says:
Sometimes, I think getting your forum post ready for submission can be as easy as creating a pdf of your post (although if your post was written in LaTeX, they'll want the tex file). If everything goes well, the submission takes less than an hour.
However, if your post doesn't look like a research article, you might have to format it more like one (and even then it's not guaranteed to get in, see this comment thread).
This looks correct to me; there are LW posts that already basically look like papers. And within the class of LW posts that should be on arXiv at all, which is the target audience of my post, posts that basically look like papers aren't vanishingly rare.
I wrote this post. I don't understand where your claim ("Arxiv mods have stated pretty explicitly they do not want your posts on Arxiv") is coming from.
Here’s how I’d quickly summarize my problems with this scheme:
- Oversight problems:
- Overseer doesn’t know: In cases where your unaided humans don’t know whether the AI action is good or bad, they won’t be able to produce feedback which selects for AIs that do good things. This is unfortunate, because we wanted to be able to make AIs that do complicated things that have good outcomes.
- Overseer is wrong: In cases where your unaided humans are actively wrong about whether the AI action is good or bad, their feedback will actively select for the AI to deceive the humans.
- Catastrophe problems:
- Even if the overseer’s feedback was perfect, a model whose strategy is to lie in wait until it has an opportunity to grab power will probably be able to successfully grab power.
This is the common presentation of issues with RLHF, and I find it so confusing.
The "oversight problems" are specific to RLHF; scalable oversight schemes attempt to address these problems.
The "catastrophe problem" is not specific to RLHF (you even say that). This problem may appear with any form of RL (that only rewards/penalises model outputs). In particular, scalable oversight schemes (if they only supervise model outputs) do not address this problem.
So why is RLHF more likely to lead to X-risk than, say, recursive reward modelling?
The case for why the "oversight problems" should lead to X-risk is very rarely made. Maybe RLHF is particularly likely to lead to the "catastrophe problem" (compared to, say, RRM), but this case is also very rarely made.
Should we do more research on improving RLHF (e.g. increasing its sample efficiency, or understanding its empirical properties) now?
I think this research, though it’s not my favorite kind of alignment research, probably contributes positively to technical alignment. Also, it maybe boosts capabilities, so I think it’s better to do this research privately (or at least not promote your results extensively in the hope of raising more funding). I normally don’t recommend that people research this, and I normally don’t recommend that projects of this type be funded.
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
Where does this difference come from?
From an alignment perspective, increasing the sample efficiency of RLHF looks really good, as this would allow us to use human experts with plenty of deliberation time to provide the supervision. I'd expect this to be at least as good as several rounds of debate, IDA, or RRM in which the human supervision is provided by human lay people under time pressure.
From a capabilities perspective, recursive oversight schemes and adversarial training also increase capabilities. E.g. a reason that Google currently doesn’t deploy their LLMs is probably that they sometimes produce offensive output, which is probably better addressed by increasing the data diversity during finetuning (e.g. adversarial training), rather than improving specification (although that’s not clear).
The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.
We often do know what rewards we "would have" provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it's nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn't appear during training, as that value has no influence on the final model.
In other words: Maybe we know what reward we "would have" provided in a situation that did not occur during training, or maybe we don't. Either way, this hypothetical reward has no causal influence on the final model, so it's silly to use this reward to categorise any alignment failures that show up in the final model.
My model is that if there are alignment failures that leave us neither dead nor disempowered, we'll just solve them eventually, in similar ways as we solve everything else: through iteration, innovation, and regulation. So, from my perspective, if we've found a reward signal that leaves us alive and in charge, we've solved the important part of outer alignment. RLHF seems to provide such a reward signal (if you exclude wire-heading issues).
If we train an RLHF agent in the real world, the reward model now has the option to accurately learn that actions that physically affect the reward-attribution process are rated in a special way. If it learns that, we are of course boned - the AI will be motivated to take over the reward hardware (even during deployment where the reward hardware does nothing) and tough luck to any humans who get in the way.
OK, so this is wire-heading, right? Then you agree that it's the wire-heading behaviours that kills us? But wire-heading (taking control of the channel of reward provision) is not, in any way, specific to RLHF. Any way of providing reward in RL leads to wire-heading. In particular, you've said that the problem with RLHF is that "RLHF/IDA/debate all incentivize promoting claims based on what the human finds most convincing and palatable, rather than on what's true." How does incentivising palatable claims lead to the RL agent taking control over the reward provision channels? These issues seem largely orthogonal. You could have a perfect reward signal that only incentivizes "true" claims, and you'd still get wire-heading.
So in which way is RLHF particularly bad? If you think that wire-heading is what does us in, why not write a post about how RL, in general, is bad?
I'm not trying to be antagonistic! I do think I probably just don't get it, and this seems a great opportunity for me finally understand this point :-)
I just don't know any plausible story for how outer alignment failures kill everyone. Even in Another (outer) alignment failure story, what ultimately does us in, is wire-heading (which I don't consider an outer alignment problem, because it happens with any possible reward).
But if the reward model doesn't learn this true fact (maybe we can prevent this by patching the RLHF scheme), then I would agree it probably won't kill everyone. Instead it would go back to failing by executing plans that deceived human evaluators in training. Though if the training was to do sufficiently impressive and powerful things in the real world, maybe this "accidentally" involves killing humans.
I agree with this. I agree that this failure mode could lead to extinction, but I'd say it's pretty unlikely. IMO, it's much more likely that we'll just eventually spot any such outer alignment issue and fix it eventually (as in the early parts of Another (outer) alignment failure story,)
Yes, this is how the paper works.
Not really. I find the simulator framing is useful to think about this.