This post argues that many prediction tasks are outer aligned at optimum. In particular, I think that the malignity of the universal prior should be treated as an inner alignment problem rather than an outer alignment problem. The main argument is entirely in the first section; treat the rest as appendices.
In Evan Hubinger’s Outer Alignment and Imitative Amplification, outer alignment at optimum is defined as follows:
a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals
In the (highly recommended) Overview of 11 proposals for building safe advanced AI, this is used to argue that:
- Imitative amplification is probably outer aligned (because it emulates HCH, as explained in the first linked post, and in section 2 of the overview)
- Microscope AI (section 5) and STEM AI (section 6) are probably outer misaligned, because they rely on prediction, and optimal prediction is characterised by Bayesian inference on the universal prior. This is a problem, since the universal prior is probably malign (see Mark Xu’s explanation and Paul Christiano’s original arguments).
I disagree with this, because I think that both imitative amplification and STEM AI would be outer aligned at optimum, and that some implementations of microscope AI would be outer aligned at optimum (see the next section for cases where it might not be). This is because optimal prediction isn’t necessarily outer misaligned.
The quickest way to see this is to note that everything is prediction. Imitative amplification relies on imitation learning, and imitating a human via imitation learning is equivalent to predicting what they’ll do (and then doing it). Thus, if microscope AI or STEM AI is outer misaligned due to relying on prediction, imitation learning is just as misaligned.
However, I don’t think that relying on prediction makes a method outer misaligned at optimum. So what’s wrong with the argument about the universal prior? Well, note that you don’t get perfect prediction on a task by doing ideal Bayesian inference on any universal prior. Different universal priors yield different predictions, so some of them are going to be better than others. Now, define optimal performance as requiring “the model to always have optimal loss on all data points that it ever encounters”, as Evan does in this footnote. Even if almost all universal priors are malign, I claim that the prior that actually does best on all data it encounters is going to be aligned. Optimising for power requires sacrificing prediction performance, so the misaligned priors are going to be objectively worse. For example, the model that answers every STEM question correctly doesn’t have any wiggle room to optimise against us, because any variation in its answers would make one of them incorrect.
The problem we face is that it's really hard to tell whether our model actually is optimal. This is because consequentialists in a malign prior can intentionally do well on the training and validation data, and only sacrifice prediction performance during deployment. Importantly, this problem applies equally strongly for imitation learning as for other types of prediction.
The most important consequence of all this is that the universal prior’s malignity should be classified as a problem with inner alignment, instead of outer alignment. I think this is good and appropriate, because concerns about consequentialists in the universal prior are very similar to concerns about mesa optimisers being found by gradient descent. In both cases, we’re worried that:
- the inductive biases of our algorithm…
- will favor power-seeking consequentialists…
- who correctly predict the training data in order to stick around…
- but eventually perform a treacherous turn after the distributional shift caused by deployment.
Online learning may be misaligned
However, one particular case where prediction may be outer misaligned is when models are asked to predict the future, and the prediction can affect what the true answer is.
For example, the stereotypical case of solomonoff induction is to place a camera somewhere on Earth, and let the inductor predict all future bits that the camera will observe. If we implemented this with a neural network and changed our actions based on what the model predicted, I’m not confident that the result would be outer aligned at optimum. However, this is unrelated to the malignity of the universal prior – instead, it’s because we might introduce strange incentives by changing our actions based on the AI’s predictions. For example, the AI might get the lowest loss if it systematically reports predictions that causes us to make the world as predictable as possible (and if we’re all dead, the world is pretty predictable…). For more on this, see Abram’s parable (excluding section 9, which is about inner alignment) and associated discussion.
To see why this concern is unrelated to the malignity of the universal prior, note that there are two cases (that mostly depend on our definition of “at optimum”):
- For each question, there is only a single prediction that leads to the lowest loss.
- For example, if we assume that the model has some degree of uncertainty about the future, it is likely that some prediction will affect the world in a way such that it’s very likely to come true, whereas other self-fulfilling prophecies may only be somewhat likely to come true.
- In this case, my argument from above applies: the model's behavior is fully specified by what the right answer is; the best prior for the situation will be selected; and there’s no room for (additional) malignity without making the prediction worse.
- For some questions, there are multiple predictions that lead to optimal performance.
- In this case, the model’s behavior isn’t fully specified by saying that it performs optimally.
- However, Evan’s definition says that “a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals”. Thus, we can tell whether the model is outer aligned at optimum by checking all combinations of optimal predictions, without having to consider the inductive biases of the model.
In practice, I don’t expect this type of concern to matter for STEM AI, but it may be a concern with some implementations of microscope AI.
As an aside, note that the definition of outer alignment at optimum only specifies the model’s input-output behavior, whereas for microscope AI, we care about the model’s internal structure. Thus, it’s unclear how to apply the definition to microscope AI. One option is to apply it to the system as a whole – including the humans trying to make the predictions. Alternatively, microscope AI may be such a unique proposal that the standard definition just isn’t useful.
How to define optimal performance?
While I think that my short argument is broadly correct, it does sweep a lot of ambiguities under the rug, because it’s surprisingly annoying to construct a rigorous definition of outer alignment at optimum. The problem is that – in order to define optimal performance – we need both a loss function and a distribution of data to evaluate the loss function on. In many cases, it’s quite difficult to pinpoint what the distribution of data is, and exactly how to apply the loss function to it. Specifically, it depends on how the correct answer is labelled during training and deployment.
How is the correct answer labelled?
As far as I can tell, there are at least 4 cases, here:
- Mechanistic labelling: In this setting, the model’s success can be verified in code. For example, when a model learns to play chess, it’s easy to check whether it won or not. Some versions of STEM AI could be in this category.
- Real-world labelling: In this setting, the model is interacting with or trying to predict the real world, such that success or failure is almost mechanically verifiable once we see how the world reacts. An example of this is an AI tasked to predict all bits observed by a single camera. Some versions of STEM AI or microscope AI could be in this category.
- Human labelling: These are cases where data points are labelled by humans in a controlled setting. Often there are some human(s), e.g. an expert labeller or a group of mturkers, who look at input and are tasked with returning some output. This includes e.g. training on ImageNet, and the kind of imitation learning that’s necessary for imitative amplification.
- Unlabelled: These are cases where the AI learns the distribution of a big pile of pre-existing data in an unsupervised manner. The distribution is then often used for purposes quite different from the training setting. GPT-3 and DALL-E are both examples of this type of learning. Some types of STEM AI or microscope AI could be in this category.
Each of these offer different options for defining optimal performance.
- With mechanistic labelling, optimal performance is unambiguous.
- With real-world labelling, you can define optimal performance as doing whatever gives the network optimal reward; since each of the model’s actions eventually does lead to some reward.
- There’s some question about whether you should define optimal performance as always predicting the thing that actually happens, or whether you should assume that the model has some particular uncertainty about the world, and define optimal performance as returning the best probability distribution, given its uncertainty. I don’t think this matters very much.
- If implemented in the wrong way, these models are vulnerable to self-fulfilling prophecies. Thus, they may be outer misaligned at optimum, as mentioned in the section on online learning above.
- With human labelling, we rarely get an objective answer during deployment, since we don’t ask humans to label data once training is over. However, we can define optimal performance according to what the labeller would have done, if they had gotten some particular input.
- In this case, the model must return some probability distribution over labels, since there’s presumably some randomness in how the labeller acts.
- However, if we want to, we can still assume that the AI’s knowledge of the outside world is completely fixed, and ask about the (impossible) counterfactual of what the human would have answered if they’d gotten a different input than they did.
- For unsupervised learning, this option isn’t available, because there’s no isolated human who we can separate from the rest of the world. If GPT-3 is presented with a prompt “Hi I’m Lukas and…”, it cannot treat it’s input as some fixed human(s) H that reacts to their input as H(“Hi I’m Lukas and...”). Instead, the majority of GPT-3’s job goes towards updating on the fact that the current source is apparently called Lukas, is trying to introduce themselves, and whatever that implies about the current source, and about the world at large. This means that for human-labelled data, we can assume that the world is fixed (or that the model’s uncertainty about the world is fixed), and only think about varying the input to the labeller. However, for unsupervised learning, we can’t hold the world fixed, because we need to find the most probable world where someone produced that particular input.
Extending the training distribution
As a consequence, when defining optimal performance for unsupervised learning, we need to define a full distribution of all possible inputs and outputs. GPT-3 was trained on internet text, but the internet text we have is very small compared to all prompts you could present to GPT-3. To define optimal performance, we therefore need to define a hypothetical process that represents much more internet text. Here are a few options for doing that:
Choose some universal prior as a measure over distributions; condition it on our finite training data; and use the resulting distribution as our ground truth distribution. As our universal prior, we could use the prior defined by some simple programming language (e.g. python) or the “true” prior that our universe is sampled from (if that’s coherent).
- Due to the ordinary arguments about the universal prior being malign, this wouldn’t be outer aligned at optimum. Since this definition would mean that almost nothing is outer aligned, it seems like a bad definition.
Defining the ground-truth of correct generalisation as the way that humans would generalise, if they became really good at predicting the training text.
- The problem with this definition is that we want to track the alignment properties of algorithms even as they reach far beyond human performance
- One option for accessing superhuman performance with human-like generalisation intuitions is to do something like Paul Christiano’s Learning the prior.
- While some variant of this could be good, becoming confident that it’s good would involve solving a host of AI alignment problems along the way, which I unfortunately won’t do in this post.
Use quantum randomness as our measure over distributions. More specifically, choose some point in the past (e.g. when Earth was created 4 billion years ago, or the internet was created 40 years ago), and then consider all possible futures from that moment, using quantum fluctuations as the only source of “randomness”. Use the Born rule to construct a measure over these worlds. (If you prefer copenhagen-like theories, this will be a probability measure. If you prefer multiverse theories, this will be a measure over close-by Everett branches.)
Then, exclude all worlds that in the year 2020 don’t contain a model with GPT-3’s architecture that was trained on GPT-3’s training data. Most of the remaining worlds will have some unobserved validation set that the researchers didn’t use during training. We can then define optimal performance as the distribution over all these validation sets, weighted by our quantum measure over the worlds they show up in.
- As far as I can tell, this is mostly well defined, and seems to yield sensible results. Since GPT-3’s training data contains so many details of our world; every world that contains a similar dataset will be very similar to our world. Lots of minor details will presumably vary, though, which means that the unobserved data should contain a wide and fair distribution.
- There are some subtleties about how we treat worlds where GPT-3 was trained multiple times on the same data, or how we treat different sizes of validation sets, etc; but I don’t think it matters much.
- I’m a bit wary of how contrived this definition is, though. We would presumably have wanted some way of defining counterfactuals even if quantum mechanics hadn’t offered this convenient splitting mechanic, so there ought to be some less hacky way of doing it.
If we wanted to, I think we could use a similar definition also in situations with real-world labelling or human labelling. Ie., we could require even an optimal model to be uncertain about everything that wasn’t universal across all Everett branches containing its training data. The main concern about this is that some questions may be deeply unlikely to appear in training data in the year 2020 (e.g. a question with the correct factorisation of RSA-2048) in which case being posed that question may move the most-likely-environment to some very strange subset of worlds. I’m unsure whether this would be a problem or not.
A note about simulations
Finally, since these definitions refer to what the AI would actually encounter, in our world, I want to briefly mention an issue with simulations. We don’t only need to worry about the possibility that a solomonoff inductor thinks its input is being simulated – we should also consider the possibility that we are in a simulation.
Most simulations are probably short-lived, since simulating low-tech planets is so much cheaper than simulating the colonisation of galaxies. Thus, if we’re total consequentialist, the long-term impact we can have in a simulation is negligible compared to the impact we can have in the real world (unless the sheer number of simulations outweighs the long-term impact we can have, see How the Simulation Argument Dampens Future Fanaticism).
As a consequence of this, the only real alignment problem introduced by simulations is if an AI assigns substantial probability to being in a simulation despite actually being in the real world. This would be bad because – if the AI predicts things as if it was in a simulation – the consequentialists that the AI believes controls the simulation will have power over what predictions it makes, which they can use to gain power in the real world. This is just a variation of the universal prior being malign, where the consequentialists are hypothesized to simulate all of Earth instead of just the data that the AI is trying to predict.
As far as practical consequences go, I think this should be treated the same as the more general problem of the universal prior being malign. Thus, I’d like to categorise it as a problem with inner alignment; and I’d like to assume that an AI that’s outer aligned at optimum would act like it’s not in a simulation, if it is in fact not in a simulation.
This happens by default if our chosen definition of optimal performance treats being-in-a-simulation as a fixed fact about its environment – that the AI is expected to know – and not as a source of uncertainty. I think my preferred solutions above capture this by default. For any solution based on how humans generalise, though, it would be important that the humans condition on not being in a simulation.
Thanks to Hjalmar Wijk and Evan Hubinger for helpful comments on earlier versions.
One way to frame this is with Pearl’s do-calculus. Say that the input is a random variable X and the output is a random variable Y. By analogy with Pearl’s do-calculus, we could then define optimal human-labelled performance as learning the distribution p(y=Y | do(x=X)), whereas unsupervised learning is trying to learn the entire distribution p(X,Y) in order to answer p(y=Y | x=X). For GPT-3, learning p(y=Y | do(x=X)) would correspond to guessing what a random human would say if they learned that they’d just typed “Hi I’m Lukas and…”; which would be very strange. ↩︎
One option is to partition the world into “macrostates” (e.g. a specification of where all humans are, what they’re saying, what the weather is, etc) and “microstates” (a complete specification of the location and state of all elementary particles), where each macrostate is consistent with lots of microstates. Then, we can specify a year; and assume that we know the macrostate of the world at the beginning of the year, but are uncertain about the microstate. If we then wait long enough, the uncertainty in microstate would eventually induce variation in macrostates; which we could use to define a distribution over data. I think this would probably yield the same results as the quantum definition; but the distinction between macrostates and microstates is a lot more vague than our understanding of quantum mechanics. ↩︎
This is the reason I wrote that we should exclude each world that “in the year 2020 don’t contain a model with GPT-3’s architecture that was trained on GPT-3’s training data”. Without the caveat about 2020, we would accidentally include worlds where humanity’s descendants decide to simulate their ancestors. ↩︎