Due to the ordinary arguments about the universal prior being malign, this wouldn’t be outer aligned at optimum. Since this definition would mean that almost nothing is outer aligned, it seems like a bad definition. ...

As far as practical consequences go, I think this should be treated the same as the more general problem of the universal prior being malign. Thus, I’d like to categorise it as a problem with inner alignment; and I’d like to assume that an AI that’s outer aligned at optimum would act like it’s not in a simulation, if it is in fact not in a simulation.

This happens by default if our chosen definition of optimal performance treats being-in-a-simulation as a fixed fact about its environment – that the AI is expected to know – and not as a source of uncertainty. I think my preferred solutions above capture this by default[3]. For any solution based on how humans generalise, though, it would be important that the humans condition on not being in a simulation.

This is unsatisfying to me. First you say that we can't define optimum in the obvious way because then very few things would be outer aligned, then you say we should define optimum in such a way that the only way to be outer aligned is to assume you aren't in a simulation. (How else would we get an AI that act's like it's not in a simulation, if it is in fact not in a simulation? You can't tell whether you are in a simulation or not, by definition, so the only way for such an AI to exist is for it to always act like it's not in a simulation, i.e. to assume.) An AI that assumes it isn't in a simulation seems like a defective AI to me, so it's weird to build that in to the definition of outer alignment.

It's possible I'm misunderstanding you though!

[-]Lukas Finnveden5y20

Things I believe about what sort of AI we want to build:

It would be kind of convenient if we had an AI that could help us do acausal trade. If assuming that it's not in a simulation would preclude an AI from doing acausal trade, that's a bit inconvenient. However, I don't think this matters for the discussion at hand, for reasons I describe in the final array of bullet points below.
Even if it did matter, I don't think that the ability to do acausal trade is a deal-breaker. If we had a corrigible, aligned, superintelligent AI that couldn't do acausal trade, we could ask it to scan our brains, then compete through any competitive period on Earth / in space, and eventually recreate us and give us enough time to figure out this acausal trade thing ourselves. Thus, for practical purposes, an AI that assumes it isn't in a simulation doesn't seem defective to me, even if that means it can't do acausal trade.

Things I believe about how to choose definitions:

When choosing how to define our terms, we should choose based on what abstractions are most useful for the task at hand. For the outer-alignment-at-optimum vs inner alignment distinction, we're trying to choose a definition of "optimal performance" such that we can separately:

Design an intent-aligned AI out of idealised training procedures that always yield "optimal performance" on some metric. If we successfully do this, we've solved outer alignment.
Figure out a training procedure that produces an AI that actually does very well on the chosen metric (sufficiently well to be aligned, even if it doesn't achieve absolute optimal performance). If we do this, we've solved inner alignment.

Things I believe about what these candidate definitions would imply:

For every AI-specification built with the abstraction "Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D across the multiverse", I think that AI is going to be misaligned (unless it's trained with data that we can't get our hands on, e.g. infinite in-distribution data), because of the standard universal-prior-is-misaligned-reasons. I think this holds true even if we're trying to predict humans like in IDA. Thus, this definition of "optimal performance" doesn't seem useful at all.
For AI-specification built with the abstraction "Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D on Earth if we aren't in a simulation", I think it probably is possible to build aligned AIs. Since it also doesn't seem impossible to train AIs to do something like this (ie we haven't just moved the impossibility to the inner alignment part of the problem), it seems like a pretty good definition of "optimal performance".

Surprisingly, I think it's even possible to build AIs that do assign some probability to being in a simulation out of this. E.g. we could train the AI via imitation learning to imitate me (Lukas). I assign a decent probability to being in a simulation, so a perfect Lukas-imitator would also assign a decent probability to being in a simulation. This is true even if the Lukas-imitator is just trying to imitate the real-world Lukas as opposed to the simulated Lukas, because real-world Lukas assigns some probability to being simulated, in his ignorance.

I'm also open to other definitions of "optimal performance". I just don't know any useful ones other than the ones I mention in the post.

[-]Daniel Kokotajlo5y10

Thanks, this is helpful.

--You might be right that an AI which assumes it isn't in a simulation is OK--but I think it's too early to conclude that yet. We should think more about acausal trade before concluding it's something we can safely ignore, even temporarily. There's a good general heuristic of "Don't make your AI assume things which you think might not be true" and I don't think we have enough reason to violate it yet.

--You say

For every AI-specification built with the abstraction "Given some finite training data D, the AI predicts the next data point X according to how common it is that X follows D across the multiverse", I think that AI is going to be misaligned (unless it's trained with data that we can't get our hands on, e.g. infinite in-distribution data), because of the standard universal-prior-is-misaligned-reasons. I think this holds true even if we're trying to predict humans like in IDA. Thus, this definition of "optimal performance" doesn't seem useful at all.

Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous. So... I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve? If this is a fair summary of what you are doing, then I retract my objections I guess, and reflect more.

[-]Lukas Finnveden5y*20

Isn't that exactly the point of the universal prior is misaligned argument? The whole point of the argument is that this abstraction/specification (and related ones) is dangerous.

Yup.

I guess your title made it sound like you were teaching us something new about prediction (as in, prediction can be outer aligned at optimum) when really you are just arguing that we should change the definition of outer-aligned-at-optimum, and your argument is that the current definition makes outer alignment too hard to achieve

I mean, it's true that I'm mostly just trying to clarify terminology. But I'm not necessarily trying to propose a new definition – I'm saying that the existing definition already implies that malign priors are an inner alignment problem, rather than than an issue with outer alignment. Evan's footnote requires the model to perform optimally on everything it actually encounters in the real world (rather than asking it to do as well as it can across the multiverse, given its training data); so that definition doesn't have a problem with malign priors. And as Richard notes here, common usage of "inner alignment" refers to any case where the model performs well on the training data but is misaligned during deployment, which definitely includes problems with malign priors. And per Rohin's comment on this post, apparently he already agrees that malign priors are an inner alignment problem.

Basically, the main point of the post is just that the 11 proposals post is wrong about mentioning malign priors as a problem with outer alignment. And then I attached 3 sections of musings that came up when trying to write that :)

[-]Daniel Kokotajlo5y10

Well, at this point I feel foolish for arguing about semantics. I appreciate your post, and don't have a problem with saying that the malignity problem is an inner alignment problem. (That is zero evidence that it isn't also an outer alignment problem though!)

Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation. We may have good pragmatic reasons to act as if it isn't, but I still think you are changing the definition of outer alignment if you think it assumes we aren't in a simulation. But *shrug* if that's what people want to do, then that's fine I guess, and I'll change my usage to conform with the majority.

[-]Lukas Finnveden5y20

Cool, seems reasonable. Here are some minor responses: (perhaps unwisely, given that we're in a semantics labyrinth)

Evan's footnote-definition doesn't rule out malign priors unless we assume that the real world isn't a simulation

Idk, if the real world is a simulation made by malign simulators, I wouldn't say that an AI accurately predicting the world is falling prey to malign priors. I would probably want my AI to accurately predict the world I'm in even if it's simulated. The simulators control everything that happens anyway, so if they want our AIs to behave in some particular way, they can always just make them do that no matter what we do.

you are changing the definition of outer alignment if you think it assumes we aren't in a simulation

Fwiw, I think this is true for a definition that always assumes that we're outside a simulation, but I think it's in line with previous definitions to say that the AI should think we're not in a simulation iff we're not in a simulation. That's just stipulating unrealistically competetent prediction. Another way to look at it is that in the limit of infinite in-distribution data, an AI may well never be able to tell whether we're in the real world or in a simulation that's identical to the real world; but they would be able to tell whether we're in a simulation with simulators who actually intervene, because it would see them intervening somewhere in its infinite dataset. And that's the type of simulators that we care about. So definitions of outer alignment that appeal to infinite data automatically assumes that AIs would be able to tell the difference between worlds that are functionally like the real world, and worlds with intervening simulators.

And then, yeah, in practice I agree we won't be able to learn whether we're in a simulation or not, because we can't guarantee in-distribution data. So this is largely semantics. But I do think definitions like this end up being practically useful, because convincing the agent that it's not individually being simulated is already an inner alignment issue, for malign-prior-reasons, and this is very similar.

[-]Rohin Shah5y60

Strongly agree that the universal prior being malign is an inner alignment concern. I think there's actually a simpler argument: Solomonoff induction is a learning process, whereas the definition is about the loss at optimum. We could operationalize that as the limit of infinite data in this case. In the limit of infinite data, Solomonoff induction is not malign (in the way that is usually meant), independent of the base universal Turing machine that you are using. The reason we're worried is that after seeing some finite amount of data, malign consequentialists might "take over" in a way that makes the infinite limit irrelevant.

That said, I think if you use the "outer aligned at optimum" you still expect that STEM AI and Microscope AI are misaligned, for a different reason Evan mentioned:

it seems likely that—in the limit—the best STEM AIs would be malign in terms of having convergent instrumental goals which cause them to be at odds with humans.

That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.

Note that in general I don't really like the "outer aligned at optimum" definition, because as you note its not clear how you define optimal performance (I especially think the problem of "what distribution are we considering" is a hard one), though I don't think that matters very much for the points discussed here.

[-]Lukas Finnveden5y30

That is, if you write down a loss function like "do the best possible science", then the literal optimal AI would take over the world and get a lot of compute and robots and experimental labs to do the best science it can do.

I think this would be true for some way to train a STEM AI with some loss functions (especially if it's RL-like, can interact with the real world, etc) but I think that there are some setups where this isn't the case (e.g. things that look more like alphafold). Specifically, I think there exists some setups and some parsimonious definition of "optimal performance" such that optimal performance is aligned: and I claim that's the more useful definition.

To be more concrete, do you think that an image classifier (trained with supervised learning) would have convergent instrumental goals that goes against human interests? For image classifiers, I think there's a natural definition of "optimal performance" that corresponds to always predicting the true label via the normal output channel; and absent inner alignment concerns, I don't think a neural network trained on infinite data with SGD would ever learn anything less aligned than that. If so, it seems like best definition of "at optimum" is the definition that says that the classifier is outer aligned at optimum.

[-]Rohin Shah5y30

Roughly speaking, you can imagine two ways to get safety:

Design the output channels so that unsafe actions / plans do not exist
Design the AI system so that even though unsafe actions / plans do exist, the AI system doesn't take them.

I would rephrase your argument as "there are some types of STEM AI that are safe because of 1, it seems that given some reasonable loss function those AI systems should be said to be outer aligned at optimum". This is also the argument that applies to image classifiers.

----

In the case where point 1 is literally true, I just wouldn't even talk about whether the system is "aligned"; if it doesn't have the possibility of an unsafe action, then whether it is "aligned" feels meaningless to me. (You can of course still say that it is "safe".)

Note that in any such situation, there is no inner alignment worry. Even if the model is completely deceptive and wants to kill as many people as possible, by hypothesis we said that unsafe actions / plans do not exist, and the model can't ever succeed at killing people.

----

A counterargument could be "okay, sure, some unsafe action / plan exists by which the AI takes over the world, but that happens only via side channels, not via the expected output channel".

I note that in this case, if you include all the channels available to the AI system, then the system is not outer aligned at optimum, because the optimal thing to do is to take over the world and then always feed in inputs to which the outputs are perfectly known leading to zero loss.

Presumably what you'd want instead is to say something like "given a model in which the only output channel available to the AI system is ___, the optimal policy that only gets to act through that channel is aligned". But this is basically saying that in the abstract model you've chosen, (1) applies; and again I feel like saying that this system is "aligned" is somehow missing the point of what "aligned" is supposed to mean.

As a concrete example, let's take your image classifier example. 1. If we change the loss function so that dogs are labeled as cats and vice versa, is it still outer aligned at optimum (assuming the original was)? 2. What if it labeled humans as gorillas?

If you said yes to both, it's still outer aligned at optimum, then hopefully you can see why the concept feels meaningless to me in this situation.

If you said no to both, these examples are no longer outer aligned at optimum, then I claim that the original loss function is also not outer aligned at optimum, because we could improve the categories used in the loss function (and it seems you agree that if the categories are worse then it is not outer aligned at optimum).

If you said yes to the first and no to the second, or yes to the second and no to the first, I have no idea what you mean by "outer aligned at optimum".

----

Separately, even when you limit to a specific action space like classifying images, I could imagine that a literally optimal policy would still be able to take over the world given that action space (think of a policy that can predict and use the butterfly effect of classifying images), so I still don't feel like it's outer aligned at optimum. (Although perhaps this still doesn't perform as well as the policy that magically knows all the answers and so can perfectly predict (what we label as) the class of any image.)

But this is not my real objection; my real objection is what I discussed above (that the concept "alignment" should not be tracking whether there does or does not exist an unsafe action in the AI's action space).

[-]Lukas Finnveden5y50

Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world. Turns out concrete examples can be ambiguous too. I was trying to highlight whether the loss function and training method incentivised taking over the world or not.

Instead of an image-classifier, lets take GPT-3, which has a wide enough action-space to take over the world. Lets assume that:

1. GPT-3 is currently being tested on on a validation set which have some correct answers. (I'm fine with "optimal performance" either requiring that GPT-3 magically returns these correct answers; or requiring that it returns some distribution along the lines that I defined in my post.)

2. The researchers will read what GPT-3 outputs, and there exists some string that causes them to go mad and give control over the world to GPT-3.

In this case, if we define optimal performance as "correctly predicting as many words as possible" or "achieve minimum total loss over the entire history of the world", I agree that optimal performance would plausibly involve taking over the world to feed itself the right questions. However, I think a better definition is "Correctly predicts every word it's asked to predict", because I think this tracks what (un)supervised learning actually selects for (absent inner alignment concerns).

To make that last point more clear; I'm claiming that in a supervised learning scenario where predicting label X always gives the neural network a small punishment, but then causes the training procedure to give the model 10 easy questions, the model will learn to not choose label X. (At least in the long run. In the short run, I could see the model being a little bit more likely to predict label X than expected, because easy questions could give the model a less steep gradient, so a neural network that predicts X would stick around a bit longer than it would have if X hadn't yielded easy questions. However, every time X is chosen, the nn is directionally pushed away from choosing it again; so in the infinite data limit, I think it would learn to not do that.) Conversely, if predicting label Y is correct on some inputs, but predicting Y immeidately lead to 10 hard questions, I don't think the model would learn to avoid label Y (though I'm unsure if the learning process would converge to choosing Y or just be unstable and never converge). I'm actually very curious if you agree with this; it seems like an important question.

(Conversely, I think that a robot RL-agent would be incentivised to control its future challenges, because of how RL loss functions are shaped. A definition of RL optimal behavior should naturally take this into account.)

Now, by defining optimal behavior as "Correctly predicts every word it's asked to predict", I'm saying that the model is outer aligned at optimum iff the model cannot simultaneously answer every question optimally and adversarially optimise against humans. And if we only consider the normal output channel, I'm saying it couldn't, because it would have to make incorrect predictions in order to broadcast its message; in which case it wouldn't be optimal.

If we also consider side-channels, this gets messier, because my chosen definition doesn't imply any constraints on what the AI could do with side channels. Thus, if it could use side channels to take over the world in any way, my definition would say it wasn't outer aligned. My preferred solution to this is just to intentionally ignore side channels when defining outer alignment at optimum (as you correctly inferred). Three reasons for this:

As mentioned above, taking side channels into account would mean that any model with powerful side channels is classified as outer misaligned, even if there's no incentive to use these side channels in any particular way.
Separately, I suspect that supervised learning normally doesn't incentivise neural networks to use side channels in any particular way (absend inner alignment concerns).
Finally, It just seems kind of useful to talk about the outer alignment properties of abstract agent-models, since not all abstract agent-models are outer aligned. Side-constraints can be handled separately.

(Btw I'd say the bad image classifier is aligned but less performance-competitive than a good image classifier, though I'm sympathetic to the view that it doesn't make sense to talk about its alignment properties at all.)

[-]Rohin Shah5y50

Oops, I actually wasn't trying to discuss whether the action-space was wide enough to take over the world.

Ah, in hindsight your comment makes more sense.

I'm actually very curious if you agree with this; it seems like an important question.

Argh, I don't know, you're positing a setup that breaks the standard ML assumptions and so things get weird. If you have vanilla SGD, I think I agree, but I wouldn't be surprised if that's totally wrong.

There are definitely setups where I don't agree, e.g. if you have an outer hyperparameter tuning loop around the SGD, then I think you can get the opposite behavior than what you're claiming (I think this paper shows this in more detail, though it's been edited significantly since I read it). That would still depend on how often you do the hyperparameter tuning, what hyperparameters you're allowed to tune, etc.

----

On the rest of the comment: I feel like the argument you're making is "when the loss function is myopic, the optimal policy ignores long-term consequences and is therefore safe". I do feel better about this calling this "aligned at optimum", if the loss function also incentivizes the AI system to do that which we designed the AI system for. It still feels like the lack of convergent instrumental subgoals is "just because of" the myopia, and that this strategy won't work more generally.

----

Returning to the original claim:

Specifically, I think there exists some setups and some parsimonious definition of "optimal performance" [for STEM AI] such that optimal performance is aligned: and I claim that's the more useful definition.

I do agree that these setups probably exist, perhaps using the myopia trick in conjunction with the simulated world trick. (I don't think myopia by itself is enough; to have STEM AI enable a pivotal act you presumably need to give the AI system a non-trivial amount of "thinking time".) I think you will still have a pretty rough time trying to define "optimal performance" in a way that doesn't depend on a lot of details of the setup, but at least conceptually I see what you mean.

I'm not as convinced that these sorts of setups are really feasible -- they seem to sacrifice a lot of benefits -- but I'm pretty unconfident here.

[-]abramdemski4y50

I think this is pretty complicated, and stretches the meaning of several of the critical terms employed in important ways. I think what you said is reasonable given the limitations of the terminology, but ultimately, may be subtly misleading.

How I would currently put it (which I think strays further from the standard terminology than your analysis):

Take 1

Prediction is not a well-defined optimization problem.

Maximum-a-posteriori reasoning (with a given prior) is a well-defined optimization problem, and we can ask whether it's outer-aligned. The answer may be "no, because the Solomonoff prior contains malign stuff".

Variational bayes (with a given prior and variational loss) is similarly well-defined. We can similarly ask whether it's outer-aligned.

Minimizing square loss with a regularizing penalty is well-defined. Etc. Etc. Etc.

But "prediction" is not a clearly specified optimization target. Even if you fix the predictive loss (square loss, Bayes loss, etc) you need to specify a prior in order to get a well-defined expectation to minimize.

So the really well-defined question is whether specific predictive optimization targets are outer-aligned at optimum. And this type of outer-alignment seems to require the target to discourage mesa-optimizers!

This is a problem for the existing terminology, since it means these objectives are not outer-aligned unless they are also inner-aligned.

Take 2

OK, but maybe you object. I'm assuming that "optimization" means "optimization of a well-defined function which we can completely evaluate". But (you might say), we can also optimize under uncertainty. We do this all the time. In your post, you frame "optimal performance" in terms of loss+distribution. Machine learning treats the data as a sample from the true distribution, and uses this as a proxy, but adds regularizers precisely because it's an imperfect proxy (but the regularizers are still just a proxy).

So, in this frame, we think of the true target function as the average loss on the true distribution (ie the distribution which will be encountered in the wild), and we think of gradient descent (and other optimization methods used inside modern ML) as optimizing a proxy (which is totally normal for optimization under uncertainty).

With this frame, I think the situation gets pretty complicated.

Take 2.1

Sure, ok, if it's just actually predicting the actual stuff, this seems pretty outer-aligned. Pedantic note: the term "alignment" is weird here. It's not "perfectly aligned" in the sense of perfectly forwarding human values. But it could be non-malign, which I think is what people mostly mean by "AI alignment" when they're being careful about meaning.

Take 2.2

But this whole frame is saying that once we have outer alignment, the problem that's left is the problem of correctly predicting the future. We have to optimize under uncertainty because we can't predict the future. An outer-aligned loss function can nonetheless yield catastrophic results because of distributional shift. The Solomonoff prior is malign because it doesn't represent the future with enough accuracy, instead containing some really weird stuff.

So, with this terminology, the inner alignment problem is the prediction problem. If we can predict well enough, then we can set up a proxy which gets us inner alignment (by heavily penalizing malign mesa-optimizers for their future treacherous turns). Otherwise, we're stuck with the inner alignment problem.

So given this use of terminology, "prediction is outer-aligned" is a pretty weird statement. Technically true, but prediction is the whole inner alignment problem.

Take 2.3

But wait, let's reconsider 2.1.

In this frame, "optimal performance" means optimal at deployment time. This means we get all the strange incentives that come from online learning. We aren't actually doing online learning, but optimal performance would respond to those incentives anyway.

(You somewhat circumvent this in your "extending the training distribution" section when you suggest proxies such as the Solomonoff distribution rather than using the actual future to define optimality. But this can reintroduce the same problem and more besides. Option #1, Solomonoff, is probably accurate enough to re-introduce the problems with self-fulfilling prophecies, besides being malign in other ways. Option #3, using a physical quantum prior, requires a solution to quantum gravity, and also is probably accurate enough to re-introduce the same problems with self-fulfilling prophecies as well. The only option I consider feasible is #2, human priors. Because humans could notice this whole problem and refuse to be part of a weird loop of self-fulfilling prediction.)

One way to frame this is with Pearl’s do-calculus. Say that the input is a random variable X and the output is a random variable Y. By analogy with Pearl’s do-calculus, we could then define optimal human-labelled performance as learning the distribution p(y=Y | do(x=X)), whereas unsupervised learning is trying to learn the entire distribution p(X,Y) in order to answer p(y=Y | x=X). For GPT-3, learning p(y=Y | do(x=X)) would correspond to guessing what a random human would say if they learned that they’d just typed “Hi I’m Lukas and…”; which would be very strange. ↩︎
One option is to partition the world into “macrostates” (e.g. a specification of where all humans are, what they’re saying, what the weather is, etc) and “microstates” (a complete specification of the location and state of all elementary particles), where each macrostate is consistent with lots of microstates. Then, we can specify a year; and assume that we know the macrostate of the world at the beginning of the year, but are uncertain about the microstate. If we then wait long enough, the uncertainty in microstate would eventually induce variation in macrostates; which we could use to define a distribution over data. I think this would probably yield the same results as the quantum definition; but the distinction between macrostates and microstates is a lot more vague than our understanding of quantum mechanics. ↩︎
This is the reason I wrote that we should exclude each world that “in the year 2020 don’t contain a model with GPT-3’s architecture that was trained on GPT-3’s training data”. Without the caveat about 2020, we would accidentally include worlds where humanity’s descendants decide to simulate their ancestors. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

10

Prediction can be Outer Aligned at Optimum

10

Take 1

Take 2

Take 2.1

Take 2.2

Take 2.3

Short argument

Online learning may be misaligned

How to define optimal performance?

How is the correct answer labelled?

Extending the training distribution

A note about simulations

Notes