NaiveTortoise

New safety research agenda: scalable agent alignment via reward modeling

Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading "Analogy to Complexity Theory"? I suspect I'm missing something but I don't see how recursive decomposition is analogous to **alternating** quantifiers?

It's obvious that, at the first level, finding an that satisfies is similar to finding the right action, but I don't see how finding and that satisfy is similar to 's solving of one of 's decomposed tasks is similar to universal quantification.

To take a very basic example, if I ask an agent to solve a simple problem like, "what is 1+2+3+4?" and the first agent decomposes it into "what is 1+2?", what "what is 3+4?", and "what is the result of '1+2' plus the result of '3+4'?" (this assumes we have some mechanism of pointing and specifying dependencies like Ought's working on), what would this look like in the alternating quantifier formulation?

Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.