All of NaiveTortoise's Comments + Replies

New safety research agenda: scalable agent alignment via reward modeling

Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.

New safety research agenda: scalable agent alignment via reward modeling

Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading "Analogy to Complexity Theory"? I suspect I'm missing something but I don't see how recursive decomposition is analogous to **alternating** quantifiers?

It's obvious that, at the first level, finding an that satisfies is similar to finding the right action, but I don't see how finding and that satisfy ... (read more)

Thanks for your question! I suspect there is some confusion going on here with what recursive reward modeling is. The example that you describe sounds like an example from imitating expert reasoning.

In recursive reward modeling, agent is not decomposing tasks, it is trying to achieve some objective that the user intends for it to perform. then assists the human in evaluating ’s behavior in order to train a reward model. Decomposition only happens on the evaluation of ’s task.

For example, proposes some plan and proposes the largest weaknes... (read more)

2Paul Christiano4y
Finding the action a that optimizes a reward function r(a) is NP-complete for general r. If the reward function r is itself able to use an oracle for NP, then that's complete for NPNP, and so on. The analogy is loose because you aren't really getting the optimal a at each step.