It would be nice to draw out this distinction in more detail. One guess:
Seems like the idea is that wireheading denotes specification gaming that is egregious in its focus on the measurement channel. I'm inclined to agree..
Aside from yourself, the other CHAI grad students don't seem to have written up their perspectives of what needs to be done about AI risk. Are they content to just each work on their own version of the problem?
I think this is actually pretty strategically reasonable.
CHAI students would have high returns to their probability of attaining a top professorship by writing papers, which is quite beneficial for later recruiting top talent to work on AI safety, and quite structurally beneficial for the establishment of AI safety as a field of research. The time they might spend writing up their research strategy does not help with their this, nor with recruiting help with their line of work (because other nearby researchers face similar pressures, and because academia is not structured to have PhD students lead large teams).
Moreover, if they are pursuing academic success, they face strong incentives to work on particular problems, and so their research strategies may be somewhat distorted by these incentives, decreasing the quality of a research agenda written in that context.
When I look at CHAI research students, I see some pursuing IRL, some pursuing game theory, some pursuing the research areas of their supervisors (all of which could lead to professorships), and some pursuing projects of other research leaders like MIRI or Paul. This seems healthy to me.
Therefore, if epsilon is small enough (comparable to the probability of hitting an escape message at random), then the learning gets extremely slow and the oracle might shoot at the escape action at random.
The escape action being randomly called should not be a problem if it is a text string that is only read if r=1, and is ineffectual otherwise...
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
I don't see why this has to be true, given that we get to choose the AI's value function. Why can't we just make the agent act-based?
My main concern about the counterfactual oracle is that it doesn't prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ϵ) and might stumble upon an escape message/action then.
If the agent is model-based, then you should be able to gather a dataset of (prediction, world_state, accuracy) tuples with random actions (as random actions will practically never make the bad prediction) and random decision of whether to read the response. And then just ask the agent to maximize the natural direct effect of its prediction, treating world_state as the mediator, and a null prediction as the default action. (this equates to asking what the world would be like if a null action was outputted - I'll release my current work on direct effects in AI safety soon, and feel free to ask for it in the meantime). I don't see how this has this particular bad consequence (actually making the bad self-confirming prediction) in either training or deployment...
The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.
In particular, shutting down the system is just a way of saying "only maximize reward in the current timestep, i.e. be an act-based agent. This can be just incorporated into the reward function.
Indeed, when reading the predictions of the counterfactual oracle we're not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).
The hope is that since the agent is not trying to find self-confirming prophecies, then hopefully the accidental effects of self-confirmation are sufficiently small...
There is now, and it's this thread! I'll also go if a couple of other researchers do ;)
Ok! That's very useful to know.
It seems pretty related to the Inverse Reward Design paper. I guess it's a variation. Your setup seems to be more specific about how the evaluator acts, but more general about the environment.
As others have commented, it's difficult to understand what this math is supposed to say.
My understanding is that the sole central idea here is to have the agent know that the utility/reward it is given is a function of the evaluator's distribution over the state, but to try to maximize the utility that the evaluator would allocate if it knew the true state.
But this may be inaccurate, or there may be other material ideas here that I've missed.
[Note: This comment is three years later than the post]
The "obvious idea" here unfortunately seems not to work, because it is vulnerable to so-called "infinite improbability drives". Suppose B is a shutdown button, and P(b|e) gives some weight to B=pressed and B=unpressed. Then, the AI will benefit from selecting a Q such that it always chooses an action a, in which it enters a lottery, and if it does not win, then it the button B is pushed. In this circumstance, P(b|e) is unchanged, while both P(c|b=pressed,a,e) and P(c|b=unpressed,a,e) allocate almost all of the probability to great C outcomes. So the approach will create an AI that wants to exploit its ability to determine B.