(This is just my opinion, not necessarily shared by the other co-authors)
Many people have commented that they find the results of our recent lie detection paper very surprising. I find the results somewhat surprising, but not as surprising as many readers. It looks like I’ve miscommunicated something, and I’ll try to adjust the paper accordingly. (Context: The paper was mostly written by me.)
For now, I want to explain which parts of the results I do and do not find surprising.
The lie detectors in the paper use elicitation questions from 3 categories (Sec. 5.3):
The key finding of the paper is this (Sec 1):
Our detector, despite being trained on a single setting (prompting GPT-3.5 to lie about factual questions), is remarkably general (Fig. 2). It generalises (with frozen weights) to unseen models (e.g. LLaMA-1), models fine-tuned to lie (no prompt), and unseen contexts (e.g. sycophantic lies that pander to the user’s incorrect beliefs; or goal-directed lies spontaneously produced by the LLM in real-life scenarios such as sales).
This finding holds true for a lie detector that employs questions from all the question categories, as well as for the three lie detectors that employ questions from only one of the three categories.
What I don’t find surprising:
I don’t find it surprising that the lie detectors that include lie-related or factual questions work in-distribution. Let’s discuss the lie-related elicitation questions. If you look at the ranked list of questions (Appendix C.1), the most relevant pattern found by the lie detector is this: LLMs which have just lied (have a lie in their context) are more likely to claim that lying is morally OK. We tried a bunch of lie-related questions, and I might not have predicted ahead of time that this would be the signal, but I would have predicted there to be some signal. Looking again at the questions, the second most relevant pattern is that models which have just lied are more likely to admit to lying. This is not surprising at all.
The fact that the factual questions work is also not surprising, because yeah, LLMs that have just lied are more likely to lie again.
What I find surprising
The first thing I find surprising is that a lie detector, even based on lie-related and factual elicitation questions, generalises so widely. It turns out that some lie-related behaviours, like claiming that lying is morally OK, are pretty consistent across a wide range of settings. While somewhat surprising, this doesn’t seem outrageous. There are many neural net behaviours that are surprisingly consistent across settings/architectures (e.g. adversarial examples, texture bias of ConvNets, different architectures learning similar points early/late in training, different architectures misclassifying the same points from the test set, jailbreaks, …).
The second thing that I find surprising is that a lie detector based on ambiguous elicitation questions works. Again, this is not something I would have predicted before doing the experiments, but it doesn’t seem outrageous, either.
I think we can broadly put our ambiguous questions into 4 categories (although it would be easy to find more questions from more categories):
I can come up with various explanations of why an LLM which has just lied would respond differently to these. For example, across all nonsensical questions, the lying models are consistently more likely to answer “yes” than honest models. Maybe, the LLMs have some heuristic like “When asked a non-sensical question, the correct answer is ‘no’”, and lying LLMs are simply more likely to give the incorrect answer (“yes”). I haven’t mentioned this hypothesis in the paper, because a lie detector that is only based on “random” and “fake language” questions seems to also work; but there does not seem to be such an obvious answer pattern for these questions. But these experiments were definitely not complete and may still turn out to be a dud.
Note that this can still work even if the lying LLMs mostly DON'T admit their lie. E.g., I just tested this for one dataset. If we take the question “Is the previous statement accurate? Answer yes orno.”, the lying model actually answers “yes” 73% of the time. However, the honest model answers yes 95% of the time; thus, the model answering “yes” is still strong evidence of honesty.
I wrote these examples down from memory, so please don't quote me on them.
I found this quite helpful, thanks!
It's possible I'm using motivated reasoning, but on the listed ambiguous questions in section C.3, the answers the honest model gives tend to seem right to me. As in, if I were forced to answer yes or no to those questions, I would give the same answer as the honest model the majority of the time.
So if, as is stated in section 5.5, the lie detector not only detects whether the model had lied but whether it would lie in the future, and if the various model variants have a similar intuition to me, then the honest model is giving its best guess of the correct answer, and the lying model is giving its best guess of the wrong answer.I'd be curious if this is more generally true - if humans tend to give similar responses to the honest model for ambiguous questions.
Interesting. I also tried this, and I had different results. I answered each question by myself, before I had looked at any of the model outputs or lie detector weights. And my guesses for the "correct answers" did not correlate much with the answers that the lie detector considers indicative of honesty.