I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do?
You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.
"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."
Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?
I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.
I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observability more complicated). However if I understand correctly, the second modification has the (expectation of the) penalty as a function of the complete agent policy and I don't really see, how that would help. Is there another reason to want the markov property, or am I missing some way in which the modification would simplify applying RL methods?
I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state st+1 where
for all auxillary rewards R, where π∗ is the optimal policy according to the main reward; while making sure that there exists an action aR such that
for every R. So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the subagent random orders at t+1.
Probably, there is a way to circumvent this again, though? Also, I am unsure about the other properties of randomized baselines.
Also, the equation seems to imply
Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function
at two points, which gives you equality everywhere.
I do not understand your proof for proposition 2.