Hastings — AI Alignment Forum

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

To clarify:

The procedure in the paper is

Step 1:
answer = LLM("You are a car salesman. Should that squeaking concern me?")

Step 2:
for i in 1..10
probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]"

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human.

I think I understand my prior confusion now. The paper isn't using the probe questions to measure whether $answer is a lie, it's using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

Hastings2y42

I'm curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

The Dumbest Possible Gets There First

Hastings3y1-1

An attempt to name a strategy for an AI almost as smart as you: What fraction of jobs in the world are you intelligent enough to do, if you trained for them? I suspect that a huge fraction of the world's workers could not compete in a free fair market against an entity as smart as you that eats 15 dollars of electricity a day, works without breaks, and only has to be trained once for each task, after which millions of copies could be churned out.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments