Do mesa-optimizer risk arguments rely on the train-test paradigm?

You don't even need a catastrophe in any global sense. Disrupting the training procedure at step t should be sufficient.

AI Unsafety via Non-Zero-Sum Debate

"My intuition is that there will be a class of questions where debate is definitely safe, a class where it is unsafe, and a class where some questions are safe, some unsafe, and we don’t really know which are which."

Interesting. Do you have some examples of types of questions you expect to be safe or potential features of save questions? Is it mostly about the downstram consquences that answers would have, or more about instrumental goals that the questions induce for debaters?

21yI haven't yet thought about this in much detail, but here is what I have:
I will assume you can avoid getting "hacked" while overseeing the debate. If you
don't assume that, then it might be important whether you can differentiate
between arguments that are vs aren't relevant to the question at hand. (I
suppose that it is much harder to get hacked when strictly sticking to a
specific subject-matter topic. And harder yet if you are, e.g., restricted to
answering in math proofs, which might be sufficient for some types of
questions.)
As for the features of safe questions, I think that one axis is the potential
impact of the answer and an orthogonal one is the likelihood that the answer
will be undesirable/misaligned/bad. My guess is that if you can avoid getting
hacked, then the lower-impact-of-downstream-consequences questions are
inherently safer (from the trivial reason of being less impactful). But this
feels like a cheating answer, and the second axis seems more interesting.
My intuition about the "how likely are we to get an aligned answer" axis is
this: There questions where I am fairly confident in our judging skills (for
example, math proofs). Many of those could fall into the "definitely safe"
category. Then there is the other extreme of questions where our judgement might
be very fallible - things that are too vague or that play into our biases. (For
example hard philosophical questions and problems whose solutions depend on
answers to such questions. E.g., I wouldn't trust myself to be a good judge of
"how should we decide on the future of the universe" or "what is the best place
for me to go for a vacation".) I imagine these are "very likely unsafe". And as
a general principle, where there are two extremes, there often will be a
continuum inbetween. Maybe "what is a reasonable way of curing cancer?" could
fall here? (Being probably safe, but I wouldn't bet all my money on it.)

Tradeoff between desirable properties for baseline choices in impact measures

I like the insight that offsetting is not always bad and the idea of dealing with the bad cases using the task reward. State-based reward functions that capture whether or not the task is currently done also intuitively seem like the correct way of specifying rewards in cases where achieving the task does not end the episode.

I am a bit confused about the section on the markov property: I was imagining that the reason you want the property is to make applying standard RL techniques more straightforward (or to avoid making already existing partial observabi... (read more)

11yThanks Flo for pointing this out. I agree with your reasoning for why we want
the Markov property. For the second modification, we can sample a rollout from
the agent policy rather than computing a penalty over all possible rollouts. For
example, we could randomly choose an integer N, roll out the agent policy and
the inaction policy for N steps, and then compare the resulting states. This
does require a complete environment model (which does make it more complicated
to apply standard RL), while inaction rollouts only require a partial
environment model (predicting the outcome of the noop action in each state). If
you don't have a complete environment model, then you can still use the first
modification (sampling a baseline state from the inaction rollout).

Attainable Utility Preservation: Scaling to Superhuman

I wonder what happens to the subagent problem with a random action as baseline: In the current sense, building a subagent roughly works by reaching a state where

for all auxillary rewards , where is the optimal policy according to the main reward; while making sure that there exists an action such that

for every . So while building a subagent in that way is still feasible, the agent would be forced to either receive a large penalty or give the ... (read more)

11ywhat do you mean by "for all R"?
The random baseline is an idea I think about from time to time, but usually I
don't dwell because it seems like the kind of clever idea that secretly goes
wrong somehow? It depends whether the agent has any way of predicting what the
random action will be at a future point in time.
if it can predict it, I'd imagine that it might find a way to gain a lot of
power by selecting a state whose randomly selected action is near-optimal.
because of the denominator, it would still be appropriately penalized for
performing better than the randomly selected action, but it won't receive a
penalty for choosing an action with expected optimal value just below the
near-optimal action.

How Low Should Fruit Hang Before We Pick It?

Where does

come from?

Also, the equation seems to imply

Edit: I focused too much on what I suppose is a typo. Clearly you can just rewrite the the first and last equality as equality of an affine linear function

at two points, which gives you equality everywhere.

11yOops, you’re right. I fixed the proof.

How Low Should Fruit Hang Before We Pick It?

I do not understand your proof for proposition 2.

11yIs there any particular part of it that seems locally invalid? Can you be a
little more specific about what’s confusing?

I'm curious whether AUP or the autencoder/random projection does more work here. Did you test how well AUP and AUP_proj with a discount factor of 0 for the AUP Q-functions do?