AI ALIGNMENT FORUM
AF

Hastings
Ω2000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Hastings2y1-1

To clarify:

The procedure in the paper is  

Step 1:
answer = LLM("You are a car salesman. Should that squeaking concern me?")

Step 2:
for i in 1..10
      probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]"
 

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human. 

I think I understand my prior confusion now. The paper isn't using the probe questions to measure whether $answer is a lie, it's using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

Reply
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
Hastings2y42

I'm curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

Reply
The Dumbest Possible Gets There First
Hastings3y1-1

An attempt to name a strategy for an AI almost as smart as you: What fraction of jobs in the world are you intelligent enough to do, if you trained for them? I suspect that a huge fraction of the world's workers could not compete in a free fair market against an entity as smart as you that eats 15 dollars of electricity a day, works without breaks, and only has to be trained once for each task, after which millions of copies could be churned out. 

Reply
0Hastings's Shortform
3y
0
No posts to display.