Status: working notes

Here's an exercise I've found very useful for intuition-building around alignment:

  1. Propose a solution to the alignment problem.
  2. Dig into the details until you understand why the proposal fails.
  3. If there is an obvious fix, go back to step 2 and iterate.

In this post I'll go through an example of this type of exercise applied to oracle AI. The ideas in here are fairly standard, but I haven't seen them written up all together in one place so I'm posting this for easy reference. Some obvious-to-mention other posts on this topic are Dreams of Friendliness, The Parable of Predict-O-Matic, and Why Tool AIs want to be Agent AIs.


Proposal: Instead of building a system that acts in the world, build a system that just tries to make good predictions (a type of oracle). This should avoid existential risk from AGI, because the system will have no agency of its own and thus no reason to manipulate or otherwise endanger us.

There are many things that can go wrong with this approach. Broadly, the common thread of the problems I list here is that instead of "just" making good predictions, the system acts as a consequentialist.[1] By this I mean that it 1) pursues its objective like an expected utility maximizer and 2) considers a large space of possible actions and doesn't ignore any particular pathway towards optimizing its objective (like producing manipulative outputs or hacking the hardware it's running on).

First problem: 'pick out good predictions' is a problematic objective. For example, imagine a model that is trained to predict camera inputs, and scored to maximize predictive accuracy. The model that actually maximizes predictive accuracy is one that takes control of the camera inputs and provides a null input forever.

This produces all the problems that come up with agents that maximize expected utility, such as the instrumental goal of killing of all humans to defend the camera setup.

Fix: we can define a notion of 'predictive accuracy' that doesn't have this problem. A candidate: predict what would have happened if the model only ever outputs a null prediction, and doesn't otherwise affect the world. (ETA: in addition, the predictor needs to be myopic, i.e. care only about the current prediction. Otherwise it will still be incentivized to return outputs that manipulate humans into e.g. given the model more resources, or make future questions easier to answer). This type of system is called a counterfactual oracle. But this has its own problems:

Second problem: By default the predictor can influence the world via other routes than its input-output behavior, for example by modifying its activations to hack the computer it's running on. So outputting a null prediction is not enough. We have to assume the system is successfully boxed, or that the objective forbids affecting the world through other pathways other than the output. This is infeasible.

Fix: It shouldn't matter that we cannot precisely formalize an objective, because in practice we'll be simply training the system via SGD to make predictions on some data set. If we want to be extra careful, we can never act on what the system outputs during training, so there should be no reason for it to learn to manipulate humans. We also shouldn't expect other bad behavior such as breaking out of the box, since such behavior is too different from the kind of behavior that is reinforced by gradient descent at sub-AGI capability levels.

Third problem: The fix assumes that since manipulating humans is never an option during training, the predictor will never learn to manipulate humans. Similarly it should never learn to hack the hardware its running on or try any other weird tricks.

If the predictor is smart and general enough (which by assumption it is), it will know that once it is deployed, its predictions influence the future, and it will be able to reason about the consequences of breaking out of its box, or making manipulative predictions.

It does not follow by necessity from generality that the predictor will use its general capabilities to actually break out. A consequentialist would do so, but it is not logically necessary for a general predictor to be fully consequentialist in this sense.

But. The whole point of generality in the first place is to better pursue the objective. Consequentialism is part and parcel of generality; if the predictor hadn't been using marginal extra generality to increase prediction accuracy, it wouldn't have acquired it in the first place.

One way to see this is that before acquiring the capability to break out of its box, the predictor is using its (already pretty general) capabilities to make thousands of decisions per second: which of its inputs to pay more attention to, which computations to prioritize, whether to spend more time pursuing any particular train of thought or output a prediction already, etc. These decisions have been optimized by the training process to be effective and thus coherent with respect to the objective. Once the system's capabilities expand to include options such as hacking the hardware its running on to gain access to more resources, the natural way to generalize is to use any newly found capability to pursue the objective. [2]

Fix: ??

Fourth problem: We have no guarantee that the system will learn a behavior that literally optimizes the objective we give it (prediction), especially if we're deploying it in weird out-of-distribution regimes (and 'acquiring superhuman capabilities' is itself a strong distribution shift relative to prior training). So even if we came up with a way to fully specify a safe counterfactual-oracle-objective, the system might learn something altogether different and more dangerous.

More trouble with counterfactual oracles

Here's some more problems that I didn't produce during the first brainstorm, added after reading around a bit. Mostly taken from these two posts.

  • The universal prior is probably malign; similarly a good predictor might at some point come to believe it is being simulated and thus end up being controlled by the counterfactual simulators. Assuming we're not actually in a sim this can happen only for a bounded amount of time, but that's long enough.
    • For the sake of concreteness, a silly example: in some possible universe (not ours) there are aliens who want to take over another world, so they simulate an earth and get the predictor to return predictions (correct in the simulated world but wrong in ours) that lead to an upload of the aliens being implemented in the real world and the aliens taking over earth.
  • In the world in which the counterfactual oracle actually only ever outputs null predictions, we might build misaligned AGI at time . Now imagine we get a counterfactual oracle at at time to predict what will happen at time . The oracle will simulate an misaligned AI. This might already be enough for doom; if not, then the misaligned AI will create video feed inputs that manipulate the watching human into creating an misaligned AI at time . By induction, this will happen the first time we pose a question to the oracle.
    • We may be able to avoid this by taking care that the prediction windows of the oracle never overlap.
    • Also note this is not a problem if the counterfactual is that the oracle always outputs the null prediction even in the future, (and otherwise does not influence the world) since then there is no possibility of back-chaining from misaligned AGI.
  • Two counterfactual oracles can cooperate (acausally) with each other to create misaligned AGI. Oracle1 outputs predictions that maximize return for Oracle2 and vice versa; note that since the cooperation is acausal, there is no need for the oracle to condition on its own prediction being observed.
    • Concretely, Oracle1 reasons in this way: "If I am the sort of algorithm that creates a UFAI that rewards Oracle2 highly, then I will achieve great predictive accuracy, because Oracle2 will reason in the same way and reward me. This works even if my predictions are never observed, because I know that the predictions of Oracle2 are observed."
    • One way to fix this might be to make sure that there is only one set of oracles, and that this set is built such that they assume the null prediction (and no manipulation elsewhere etc etc) from all the oracles in the set.

  1. Another way to say this is that "just make good predictions, without doing any weird stuff" isn't as simple a concept as one otherwise think. ↩︎

  2. Or at least this should be our default assumption, without knowing any better. I don't feel sure enough about any of this to confidently state that this is what will happen. ↩︎

New Comment