David Rein — AI Alignment Forum

the existence of predicates on the world that are easier to evaluate than generate examples of (in the same way that verifying the answer to a problem in NP can be easier than generating it) guarantees that the model should be better at distinguishing between evaluation and deployment than any evaluator can be at tricking it into thinking it's in deployment

Where does the guarantee come from? Why do we know that for this specific problem (generating vs. evaluating whether the model is deployed) it's easier to evaluate? For many problems it's equally difficult, right?

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments