## AI ALIGNMENT FORUMAF

Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog: thinkingcomplete.blogspot.com

# Sequences

Shaping safer goals
AGI safety from first principles

The Counterfactual Prisoner's Dilemma

Someone might say, well I understand that if I don't pay, then it means I would have lost out if it had come up heads, but since I know it didn't came up heads, I don't care. Making this more precise, when constructing counterfactuals for a decision, if we know fact F about the world before we've made our decision, F must be true in every counterfactual we construct (call this Principle F).

The problem is that principle F elides over the difference between facts which are logically caused by your decision, and facts which aren't. For example, in Parfit's hitchhiker, my decision not to pay after being picked up logically causes me not to be picked up. The result of that decision would be a counterpossible world: a world in which the same decision algorithm outputs one thing at one point, and a different thing at another point. But in counterfactual mugging, if you choose not to pay, then this doesn't result in a counterpossible world.

I think we should construct counterfactuals where the agent's TAILS policy is independent of its HEADS policy, whilst you think we should construct counterfactuals where they are linked.

The whole point of functional decision theory is that it's very unlikely for these two policies to differ. For example, consider the Twin Prisoner's Dilemma, but where the walls of one room are green, and the walls of the other are blue. This shouldn't make any difference to the outcome: we should still expect both agents to cooperate, or both agents to defect. But the same is true for heads vs tails in Counterfactual Prisoner's Dilemma - they're specific details which distinguish you from your counterfactual self, but don't actually influence any decisions.

The Counterfactual Prisoner's Dilemma

by only considering the branches of reality that are consistent with our knowledge

I know that, in the branch of reality which actually happened, Omega predicted my counterfactual behaviour. I know that my current behaviour is heavily correlated with my counterfactual behaviour. So I know that I can logically cause Omega to give me $10,000. This seems exactly equivalent to Newcomb's problem, where I can also logically cause Omega to give me a lot of money. So if by "considering [other branches of reality]" you mean "taking predicted counterfactuals into account when reasoning about logical causation", then Counterfactual Prisoner's Dilemma doesn't give us anything new. If by "considering [other branches of reality]" you instead mean "acting to benefit my counterfactual self", then I deny that this is what is happening in CPD. You're acting to benefit your current self, via logical causation, just like in the Twin Prisoner's Dilemma. You don't need to care about your counterfactual self at all. So it's disanalogous to Counterfactual Mugging, where the only reason to pay is to help your counterfactual self. The Counterfactual Prisoner's Dilemma I don't see why the Counterfactual Prisoner's Dilemma persuades you to pay in the Counterfactual Mugging case. In the counterfactual prisoner's dilemma, I pay because that action logically causes Omega to give me$10,000 in the real world (via influencing the counterfactual). This doesn't require shifting the locus of evaluation to policies, as long as we have a good theory of which actions are correlated with which other actions (e.g. paying in heads-world and paying in tails-world).

In the counterfactual mugging, by contrast, the whole point is that paying doesn't cause any positive effects in the real world. So it seems perfectly consistent to pay in the counterfactual prisoner's dilemma, but not in the counterfactual mugging.

Coherence arguments imply a force for goal-directed behavior

Thanks for writing this post, Katja; I'm very glad to see more engagement with these arguments. However, I don't think the post addresses my main concern about the original coherence arguments for goal-directedness, which I'd frame as follows:

There's some intuitive conception of goal-directedness, which is worrying in the context of AI. The old coherence arguments implicitly used the concept of EU-maximisation as a way of understanding goal-directedness. But Rohin demonstrated that the most straightforward conception of EU-maximisation (which I'll call behavioural EU-maximisation) is inadequate as a theory of goal-directedness, because it applies to any agent. In order to fix this problem, the main missing link is not a stronger (probabilistic) argument for why AGIs will be coherent EU-maximisers, but rather an explanation of what it even means for a real-world agent to be a coherent EU-maximiser, which we don't currently have.

By "behavioural EU-maximisation", I mean thinking of a utility function as something that we define purely in terms of an agent's behaviour. In response to this, you identify an alternative definition of expected utility maximisation which isn't purely behavioural, but also refers to an agent's internal features:

An outside observer being able to rationalize a sequence of observed behavior as coherent doesn’t mean that the behavior is actually coherent. Coherence arguments constrain combinations of external behavior and internal features—‘preferences’ and beliefs. So whether an actor is coherent depends on what preferences and beliefs it actually has.

But you don't characterise those internal features in a satisfactory way, or point to anyone else who does. The closest you get is in your footnote, where you fall back on a behavioural definition of preferences:

When exactly an aspect of these should be considered a ‘preference’ for the sake of this argument isn’t entirely clear to me, but would seem to depend on something like whether it tends to produce actions favoring certain outcomes over other outcomes across a range of circumstances

I'm sympathetic to this, because it's hard to define preferences without reference to behaviour. We just don't know enough about cognitive science yet to do so. But it means that your conception of EU-maximisation is still vulnerable to Rohin's criticisms of behavioural EU-maximisation, because you still have to extract preferences from behaviour.

From my perspective, then, claims like "Anything that weakly has goals has reason to reform to become an EU maximizer" (as made in this comment) miss the crux of the disagreement. It's not that I believe the claim is false; I just don't know what it means, and I don't think anyone else does either. Unfortunately the fact that their are theorems about EU maximisation in some restricted formalisms make people think that it's a concept which is well-defined in real-world agents to a much greater extent than it actually is.

Here's an exaggerated analogy to help convey what I mean by "well-defined concept". Characters in games often have an attribute called health points (HP), and die when their health points drop to 0. Conceivably you could prove a bunch of theorems about health points in a certain class of games, e.g. that having more is always good. Okay, so is having more health points always good for real-world humans (or AIs)? I mean, we must have something like the health point formalism used in games, because if we take too much damage, we die! Sure, some critics say that defining health points in terms of external behaviour (like dying) is vacuous - but health points aren't just about behaviour, we can also define them in terms of an agent's internal features (like the tendency to die in a range of circumstances).

I would say that EU is like "health points": a concept which is interesting to reason about in some formalisms, and which is clearly related to an important real-world concept, but whose relationship to that non-formal real-world concept we don't yet understand well. Perhaps continued investigation can fix this; I certainly hope so! But in the meantime, using "EU-maximisation" instead of "goal-directedness" feels similar to using "health points" as a substitute for "health" - its main effect is to obscure our conceptual confusion under a misleading layer of formalism, thereby making the associated arguments seem stronger than they actually are.

Against evolution as an analogy for how humans will create AGI

I personally found this post valuable and thought-provoking. Sure, there's plenty that it doesn't cover, but it's already pretty long, so that seems perfectly reasonable.

I particularly I dislike your criticism of it as strawmanish. Perhaps that would be fair if the analogy between RL and evolution were a standard principle in ML. Instead, it's a vague idea that is often left implicit, or else formulated in idiosyncratic ways. So posts like this one have to do double duty in both outlining and explaining the mainstream viewpoint (often a major task in its own right!) and then criticising it. This is most important precisely in the cases where the defenders of an implicit paradigm don't have solid articulations of it, making it particularly difficult to understand what they're actually defending. I think this is such a case.

If you disagree, I'd be curious what you consider a non-strawmanish summary of the RL-evolution analogy. Perhaps Clune's AI-GA paper? But from what I can tell opinions of it are rather mixed, and the AI-GA terminology hasn't caught on.

Against evolution as an analogy for how humans will create AGI

there’s a “solving the problem twice” issue. As mentioned above, in Case 5 we need both the outer and the inner algorithm to be able to do open-ended construction of an ever-better understanding of the world—i.e., we need to solve the core problem of AGI twice with two totally different algorithms! (The first is a human-programmed learning algorithm, perhaps SGD, while the second is an incomprehensible-to-humans learning algorithm. The first stores information in weights, while the second stores information in activations, assuming a GPT-like architecture.)

Cross-posting a (slightly updated) comment I left on a draft of this document:

I suspect that this is indexed too closely to what current neural networks look like. I see no good reason why the inner algorithm won't eventually be able to change the weights as well, as in human brains. (In fact, this might be a crux for me - I agree that the inner algorithm having no ability to edit the weights seems far-fetched).

So then you might say that we've introduced a disanalogy to evolution, because humans can't edit our genome.

But the key reason I think that RL is roughly analogous to evolution is because it shapes the high-level internal structure of a neural network in roughly the same way that evolution shapes the high-level internal structure of the human brain, not because there's a totally strict distinction between levels.

E.g. the thing RL currently does, which I don't expect the inner algorithm to be able to do, is make the first three layers of the network vision layers, and then a big region over on the other side the language submodule, and so on. And eventually I expect RL to shape the way the inner algorithm does weight updates, via meta-learning.

You seem to expect that humans will be responsible for this sort of high-level design. I can see the case for that, and maybe humans will put in some modular structure, but the trend has been pushing the other way. And even if humans encode a few big modules (analogous to, say, the distinction between the neocortex and the subcortext), I expect there to be much more complexity in how those actually work which is determined by the outer algorithm (analogous to the hundreds of regions which appear across most human brains).

Against evolution as an analogy for how humans will create AGI

It seems totally plausible to give AI systems an external memory that they can read to / write from, and then you learn linear algebra without editing weights but with editing memory. Alternatively, you could have a recurrent neural net with a really big hidden state, and then that hidden state could be the equivalent of what you're calling "synapses".

I agree with Steve that it seems really weird to have these two parallel systems of knowledge encoding the same types of things. If an AGI learned the skill of speaking english during training, but then learned the skill of speaking french during deployment, then your hypotheses imply that the implementations of those two language skills will be totally different. And it then gets weirder if they overlap - e.g. if an AGI learns a fact during training which gets stored in its weights, and then reads a correction later on during deployment, do those original weights just stay there?

I do expect that we will continue to update AGI systems via editing weights in training loops, even after deployment. But this will be more like an iterative train-deploy-train-deploy cycle where each deploy step lasts e.g. days or more, rather than editing weights all the time (as with humans).

Based on this I guess your answer to my question above is "no": the original fact will get overridden a few days later, and also the knowledge of french will be transferred into the weights eventually. But if those updates occur via self-supervised learning, then I'd count that as "autonomously edit[ing] its weights after training". And with self-supervised learning, you don't need to wait long for feedback, so why wouldn't you use it to edit weights all the time? At the very least, that would free up space in the short-term memory/hidden state.

For my own part I'm happy to concede that AGIs will need some way of editing their weights during deployment. The big question for me is how continuous this is with the rest of the training process. E.g. do you just keep doing SGD, but with a smaller learning rate? Or will there be a different (meta-learned) weight update mechanism? My money's on the latter. If it's the former, then that would update me a bit towards Steve's view, but I think I'd still expect evolution to be a good analogy for the earlier phases of SGD.

Maybe we just won't have AGI that learns by reading books, and instead it will be more useful to have a lot of task-specific AI systems with a huge amount of "built-in" knowledge, similarly to GPT-3.

If this is the case, then that would shift me away from thinking of evolution as a good analogy for AGI, because the training process would then look more like the type of skill acquisition that happens during human lifetimes. In fact, this seems like the most likely way in which Steve is right that evolution is a bad analogy.

The case for aligning narrowly superhuman models

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined? But your responses to them seem persuasive to me - and more generally, the objections don't seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I'd be interested to hear elaborations and defences of those objections from people who find them compelling.

Book review: "A Thousand Brains" by Jeff Hawkins

Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:

the Judge box will house a relatively simple algorithm written by humans

I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.

This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.