Standard ML Oracles vs Counterfactual ones

[-]Rohin Shah7y20

It's not clear to me why the BFO would converge to a fixed point of $μ$ . If we've solved the problem of embedded agency and the AI system knows that $y_{t}$ can depend on its prediction $z_{t}$ , then it would tend to find a fixed point, but it could also do the sort of counterfactual reasoning you say it can't do. If we haven't solved embedded agency, then it seems like the function that best explains the data is to posit the existence of some other classifier $h$ that works the same way that the AI did in past timesteps, and that $y_{t} = μ (h (x_{t})) + v (h (x_{t}))$ . Intuitively, this is saying that the past data is explained by a hypothetical other classifier that worked the same way as the AI used to, and now the AI thinks one level higher than that. This probably does converge to a fixed point eventually, but at any given timestep the best hypothesis would be something that is a finite number of applications of $μ$ and $v$ .

The BFO can generally cope with humans observing $z_{t} = f (y_{t})$

Should this be $z_{t} = f (x_{t})$ ?

[-]Stuart_Armstrong7y10

Thanks, corrected that.

[-]Ofer7y20

Unrelated to the effects of the outputs, I think the BFO approach might have the following limitation (this is inspired by your comment on a setup I once described):

It might be the case that in turn $t^{'}$ , when the programmers try to use the BFO to get answers to some hard questions, the "training distribution" at that point would be "easier" (or otherwise different) than the "test distribution" in some important sense, such that the BFO won't provide useful answers. For example, perhaps a BFO couldn't be used, soon after development, to answer: "If we give $10,000,000 to a research lab X for their project Y, will the first FDA-approved cure for Alzheimer's be approved at least twice sooner?".

But even if that's the case, a BFO might be sufficiently useful, even if it could always answer only questions that are not very different than the previous ones. A gradual process of answering harder and harder questions might provide sufficient value sufficiently quickly.

[-]Wei Dai6y10

The BFO can generally cope with humans observing $z_{t} = f (x_{t})$ and modifying our behaviour because of it (ie does not need a counterfactual approach).

Stuart, what's your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?

However, it is not a fixed point that the BFO is likely to find, because $z_{i} + ϵ$ would not be such an encoding for almost all $ϵ$ , so the basin of attraction for this $z_{i}$ is tiny (basically only those $ϵ$ sufficiently small to preserve all the digits of $z_{i}$ ). Thus the BFO is very unlikely to stumble upon it.

This seems to constitute a form of overfitting or failure to generalize, that is likely to be removed by future advances in ML or AI in general (if there aren't already proposed regularizers that can do so). (A more capable AI wouldn't need to have actually "sumbled upon" the basin of attraction of this fixed point in its past data in order to extrapolate that it exists.) If you stick to using the "standard ML" BFO in order to be safe from such manipulation, wouldn't you lose out in competitiveness against other AIs due to this kind of overfitting / failure to generalize?

[-]Stuart_Armstrong6y10

Stuart, what's your current opinion on backward-facing oracles? I ask because it seems like in a later post you changed your mind and went back to thinking that a counterfactual approach is needed to avoid manipulation after all. Is that right?

Basically, yes. I'm no longer sure the terminology and concept of BFO is that useful, and I think all self-confirming oracles have problems. I also believe that these problems need not be cleanly of a "manipulation" or "non-manipulation" type.

I also believe there are smoother ways to reach "manipulation", so my point starting "However, it is not a fixed point that the BFO is likely to find..." is wrong.

I'll add a comment at the beginning of the post, clarifying this post is no longer my current best model.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

8

Standard ML Oracles vs Counterfactual ones

8

Forwards or backwards looking

Observer perturbation effects

Assume the oracle knows our plans and its actions

Example

Multiple attractive fixed points

Manipulation?

Problems with causality

Spurious beliefs

False models

In conclusion