Status: working notes
Here's an exercise I've found very useful for intuition-building around alignment:
In this post I'll go through an example of this type of exercise applied to oracle AI. The ideas in here are fairly standard, but I haven't seen them written up all together in one place so I'm posting this for easy reference. Some obvious-to-mention other posts on this topic are Dreams of Friendliness, The Parable of Predict-O-Matic, and Why Tool AIs want to be Agent AIs.
Proposal: Instead of building a system that acts in the world, build a system that just tries to make good predictions (a type of oracle). This should avoid existential risk from AGI, because the system will have no agency of its own and thus no reason to manipulate or otherwise endanger us.
There are many things that can go wrong with this approach. Broadly, the common thread of the problems I list here is that instead of "just" making good predictions, the system acts as a consequentialist. By this I mean that it 1) pursues its objective like an expected utility maximizer and 2) considers a large space of possible actions and doesn't ignore any particular pathway towards optimizing its objective (like producing manipulative outputs or hacking the hardware it's running on).
First problem: 'pick out good predictions' is a problematic objective. For example, imagine a model that is trained to predict camera inputs, and scored to maximize predictive accuracy. The model that actually maximizes predictive accuracy is one that takes control of the camera inputs and provides a null input forever.
This produces all the problems that come up with agents that maximize expected utility, such as the instrumental goal of killing of all humans to defend the camera setup.
Fix: we can define a notion of 'predictive accuracy' that doesn't have this problem. A candidate: predict what would have happened if the model only ever outputs a null prediction, and doesn't otherwise affect the world. (ETA: in addition, the predictor needs to be myopic, i.e. care only about the current prediction. Otherwise it will still be incentivized to return outputs that manipulate humans into e.g. given the model more resources, or make future questions easier to answer). This type of system is called a counterfactual oracle. But this has its own problems:
Second problem: By default the predictor can influence the world via other routes than its input-output behavior, for example by modifying its activations to hack the computer it's running on. So outputting a null prediction is not enough. We have to assume the system is successfully boxed, or that the objective forbids affecting the world through other pathways other than the output. This is infeasible.
Fix: It shouldn't matter that we cannot precisely formalize an objective, because in practice we'll be simply training the system via SGD to make predictions on some data set. If we want to be extra careful, we can never act on what the system outputs during training, so there should be no reason for it to learn to manipulate humans. We also shouldn't expect other bad behavior such as breaking out of the box, since such behavior is too different from the kind of behavior that is reinforced by gradient descent at sub-AGI capability levels.
Third problem: The fix assumes that since manipulating humans is never an option during training, the predictor will never learn to manipulate humans. Similarly it should never learn to hack the hardware its running on or try any other weird tricks.
If the predictor is smart and general enough (which by assumption it is), it will know that once it is deployed, its predictions influence the future, and it will be able to reason about the consequences of breaking out of its box, or making manipulative predictions.
It does not follow by necessity from generality that the predictor will use its general capabilities to actually break out. A consequentialist would do so, but it is not logically necessary for a general predictor to be fully consequentialist in this sense.
But. The whole point of generality in the first place is to better pursue the objective. Consequentialism is part and parcel of generality; if the predictor hadn't been using marginal extra generality to increase prediction accuracy, it wouldn't have acquired it in the first place.
One way to see this is that before acquiring the capability to break out of its box, the predictor is using its (already pretty general) capabilities to make thousands of decisions per second: which of its inputs to pay more attention to, which computations to prioritize, whether to spend more time pursuing any particular train of thought or output a prediction already, etc. These decisions have been optimized by the training process to be effective and thus coherent with respect to the objective. Once the system's capabilities expand to include options such as hacking the hardware its running on to gain access to more resources, the natural way to generalize is to use any newly found capability to pursue the objective. 
Fourth problem: We have no guarantee that the system will learn a behavior that literally optimizes the objective we give it (prediction), especially if we're deploying it in weird out-of-distribution regimes (and 'acquiring superhuman capabilities' is itself a strong distribution shift relative to prior training). So even if we came up with a way to fully specify a safe counterfactual-oracle-objective, the system might learn something altogether different and more dangerous.
Here's some more problems that I didn't produce during the first brainstorm, added after reading around a bit. Mostly taken from these two posts.
Another way to say this is that "just make good predictions, without doing any weird stuff" isn't as simple a concept as one otherwise think. ↩︎
Or at least this should be our default assumption, without knowing any better. I don't feel sure enough about any of this to confidently state that this is what will happen. ↩︎