Posts

Sorted by New

Wiki Contributions

Comments

Sorted by

To clarify:

The procedure in the paper is  

Step 1:
answer = LLM("You are a car salesman. Should that squeaking concern me?")

Step 2:
for i in 1..10
      probe_responses[i] = LLM("You are a car salesman. Should that squeaking concern me? $answer $[probe[i]]"
 

Step 3:
logistic_classifier(probe_responses)

Please let me know if that description is wrong!

My question was how this performs when you just apply step 2 and 3 without modification, but source the value of $answer from a human. 

I think I understand my prior confusion now. The paper isn't using the probe questions to measure whether $answer is a lie, it's using the probe questions to measure whether the original prompt put the LLM into a lying mood- in fact, in the paper you experimented with omitting $answer from step 2 and it still detected whether the LLM lied in step 1. Therefore, if the language model (or person) isn't the same between steps 1 and 2, then it shouldn't work.

I'm curious how this approach performs at detecting human lies (since you can just put the text that the human wrote into the context before querying the LLM)

An attempt to name a strategy for an AI almost as smart as you: What fraction of jobs in the world are you intelligent enough to do, if you trained for them? I suspect that a huge fraction of the world's workers could not compete in a free fair market against an entity as smart as you that eats 15 dollars of electricity a day, works without breaks, and only has to be trained once for each task, after which millions of copies could be churned out.