Thanks a lot! This definitely clears things up and also highlights the difference between recursive reward modeling and typical amplification/the expert imitation approach you mentioned.
Was anyone else unconvinced/confused (I was charitably confused, uncharitably unconvinced) by the analogy between recursive task/agent decomposition and first-order logic in section 3 under the heading "Analogy to Complexity Theory"? I suspect I'm missing something but I don't see how recursive decomposition is analogous to **alternating** quantifiers?
It's obvious that, at the first level, finding an x that satisfies ϕ(x) is similar to finding the right action, but I don't see how finding x and y that satisfy ∃x∀y ϕ(x,y) is similar to A2's solving of one of A1's decomposed tasks is similar to universal quantification.
To take a very basic example, if I ask an agent to solve a simple problem like, "what is 1+2+3+4?" and the first agent decomposes it into "what is 1+2?", what "what is 3+4?", and "what is the result of '1+2' plus the result of '3+4'?" (this assumes we have some mechanism of pointing and specifying dependencies like Ought's working on), what would this look like in the alternating quantifier formulation?