AI ALIGNMENT FORUM
AF

David Rein
000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
0phone.spinning's Shortform
3y
0
When can we trust model evaluations?
David Rein2y00

the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it's in deployment

 

Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it's easier to evaluate? For many problems it's equally difficult, right?

Reply
No posts to display.