Richard Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


Copying over a response I wrote on Twitter to Emmett Shear, who argued that "it's just a bad way to solve the problem. An ever more powerful and sophisticated enemy? ... If the process continues you just lose eventually".

I think there are (at least) two strong reasons to like this approach:

1. It’s complementary with alignment.

2.  It’s iterative and incremental. The frame where you need to just “solve” alignment is often counterproductive. When thinking about control you can focus on gradually ramping up from setups that would control human-level AGIs, to setups that would control slightly superhuman AGIs, to…

As one example of this: as you get increasingly powerful AGI you can use it to identify more and more vulnerabilities in your code. Eventually you’ll get a system that can write provably secure code. Ofc that’s still not a perfect guarantee, but if it happens before the level at which AGI gets really dangerous, that would be super helpful.

This is related to a more general criticism I have of the P(doom) framing: that it’s hard to optimize because it’s a nonlocal criterion. The effects of your actions will depend on how everyone responds to them, how they affect the deployment of the next generation of AIs, etc. An alternative framing I’ve been thinking about: the g(doom) framing. That is, as individuals we should each be trying to raise the general intelligence  threshold at which bad things happen.

This is much more tractable to optimize! If I make my servers 10% more secure, then maybe an AGI needs to be 1% more intelligent in order to escape. If I make my alignment techniques 10% better, then maybe the AGI becomes misaligned 1% later in the training process.

You might say: “well, what happens after that”? But my point is that, as individuals, it’s counterproductive to each try to solve the whole problem ourselves. We need to make contributions that add up (across thousands of people) to decreasing P(doom), and I think approaches like AI control significantly increase g(doom) (the level of general intelligence at which you get doom), thereby buying more time for automated alignment, governance efforts, etc.

People sometimes try to reason about the likelihood of deceptive alignment by appealing to speed priors and simplicity priors. I don't like such appeals, because I think that the differences between aligned and deceptive AGIs will likely be a very small proportion of the total space/time complexity of an AGI. More specifically:

1. If AGIs had to rederive deceptive alignment in every episode, that would make a big speed difference. But presumably, after thinking about it a few times during training, they will remember their conclusions for a while, and bring them to mind in whichever episodes they're relevant. So the speed cost of deception will be amortized across the (likely very long) training period.

2. AGIs will represent a huge number of beliefs and heuristics which inform their actions (e.g. every single fact they know). A heuristic like "when you see X, initiate the world takeover plan" would therefore constitute a very small proportion of the total information represented in the network; it'd be hard to regularize it away without regularizing away most of the AGI's knowledge.

I think that something like the speed vs simplicity tradeoff is relevant to the likelihood of deceptive alignment, but it needs to be more nuanced. One idea I've been playing around with: the tradeoff between conservatism and systematization (as discussed here). An agent that prioritizes conservatism will tend to do the things they've previously done. An agent that prioritizes systematization will tend to do the things that are favored by simple arguments.

To illustrate: suppose you have an argument in your head like "if I get a chance to take a 60/40 double-or-nothing bet for all my assets, I should". Suppose you've thought about this a bunch and you're intellectually convinced of it. Then you're actually confronted with the situation. Some people will be more conservative, and follow their gut ("I know I said I would, but... this is kinda crazy"). Others (like most utilitarians and rationalists) will be more systematizing ("it makes sense, let's do it"). Intuitively, you could also think of this as a tradeoff between memorization and generalization; or between a more egalitarian decision-making process ("most of my heuristics say no") and a more centralized process ("my intellectual parts say yes"). I don't know how to formalize any of these ideas, but I'd like to try to figure it out.

Ty for review. I still think it's better, because it gets closer to concepts that might actually be investigated directly. But happy to agree to disagree here.

Small relevant datapoint: the paper version of this was just accepted to ICLR, making it the first time a high-level "case for misalignment as an x-risk" has been accepted at a major ML conference, to my knowledge. (Though Langosco's goal misgeneralization paper did this a little bit, and was accepted at ICML.)

Can you construct an example where the value over something would change to be simpler/more systemic, but in which the change isn't forced on the agent downstream of some epistemic updates to its model of what it values? Just as a side-effect of it putting the value/the gear into the context of a broader/higher-abstraction model (e. g., the gear's role in the whole mechanism)?

I think some of my examples do this. E.g. you used to value this particular gear (which happens to be the one that moves the piston) rotating, but now you value the gear that moves the piston rotating, and it's fine if the specific gear gets swapped out for a copy. I'm not assuming there's a mistake anywhere, I'm just assuming you switch from caring about one type of property it has (physical) to another (functional).

In general, in the higher-abstraction model each component will acquire new relational/functional properties which may end up being prioritized over the physical properties it had in the lower-abstraction model.

I picture you saying "well, you could just not prioritize them". But in some cases this adds a bunch of complexity. E.g. suppose that you start off by valuing "this particular gear", but you realize that atoms are constantly being removed and new ones added (implausibly, but let's assume it's a self-repairing gear) and so there's no clear line between this gear and some other gear. Whereas, suppose we assume that there is a clear, simple definition of "the gear that moves the piston"—then valuing that could be much simpler.

Zooming out: previously you said

I agree that there are some very interesting and tricky dynamics underlying even very subtle ontology breakdowns. But I think that's a separate topic. I think that, if you have some value , and it doesn't run into direct conflict with any other values you have, and your model of  isn't wrong at the abstraction level it's defined at, you'll never want to change .

I'm worried that we're just talking about different things here, because I totally agree with what you're saying. My main claims are twofold. First, insofar as you value simplicity (which I think most agents strongly do) then you're going to systematize your values. And secondly, insofar as you have an incomplete ontology (which every agent does) and you value having well-defined preferences over a wide range of situations, then you're going to systematize your values.

Separately, if you have neither of these things, you might find yourself identifying instrumental strategies that are very abstract (or very concrete). That seems fine, no objections there. If you then cache these instrumental strategies, and forget to update them, then that might look very similar to value systematization or concretization. But it could also look very different—e.g. the cached strategies could be much more complicated to specify than the original values; and they could be defined over a much smaller range of situations. So I think there are two separate things going on here.

(drafted this reply a couple months ago but forgot to send it, sorry)

your low-level model of a specific gear's dynamics didn't change — locally, it was as correct as it could ever be.

And if you had a terminal utility function over that gear (e. g., "I want it to spin at the rate of 1 rotation per minutes"), that utility function won't change in the light of your model expanding, either. Why would it?

Let me list some ways in which it could change:

  • Your criteria for what counts as "the same gear" changes as you think more about continuity of identity over time. Once the gear stars wearing down, this will affect what you choose to do.
  • After learning about relativity, your concepts of "spinning" and "minutes" change, as you realize they depend on the reference frame of the observer.
  • You might realize that your mental pointer to the gear you care about identified it in terms of its function not its physical position. For example, you might have cared about "the gear that was driving the piston continuing to rotate", but then realize that it's a different gear that's driving the piston than you thought.

These are a little contrived. But so too is the notion of a value that's about such a basic phenomenon as a single gear spinning. In practice almost all human values are (and almost all AI values will be) focused on much more complex entities, where there's much more room for change as your model expands.

Take a given deontological rule, like "killing is bad". Let's say we view it as a constraint on the allowable actions; or, in other words, a probability distribution over your actions that "predicts" that you're very likely/unlikely to take specific actions. Probability distributions of this form could be transformed into utility functions by reverse-softmaxing them; thus, it's perfectly coherent to model a deontologist as an agent with a lot of separate utility functions.

This doesn't actually address the problem of underspecification, it just shuffles it somewhere else. When you have to choose between two bad things, how do you do so? Well, it depends on which probability distributions you've chosen, which have a number of free parameters. And it depends very sensitively on free parameters, because the region where two deontological rules clash is going to be a small proportion of your overall distribution.

In the standard story, what are the terminal goals? You could say "random" or "a mess", but I think it's pretty compelling to say "well, they're probably related to the things that the agent was rewarded for during training". And those things will likely include "curiosity, gaining access to more tools or stockpiling resources".

I call these "convergent final goals" and talk more about them in this post.

I also think that an AGI might systematize other goals that aren't convergent final goals, but these seem harder to reason about, and my central story for which goals it systematizes are convergent final goals. (Note that this is somewhat true for humans, as well: e.g. curiosity and empowerment/success are final-ish goals for many people.)

Forgot to reply to this at the time, but I think this is a pretty good ITT. (I think there's probably some additional argument that people would make about why this isn't just an isolated analogy, but rather a more generally-applicable argument, but it does seem to be a fairly central example of that generally-applicable argument.)

Why not? It seems like this is a good description of how values change for humans under self-reflection; why not for AIs?

I'd classify them as values insofar as people care about them intrinsically.

Then they might also be strategies, insofar as people also care about them instrumentally.

I guess I should get rid of the "only" in the sentence you quoted? But I do want to convey "something which is only a strategy, not a goal or value, doesn't have any intrinsic value". Will think about phrasing.

It's not actually the case that the derivation of a higher abstraction level always changes our lower-level representation. Again, consider people -> social groups -> countries. Our models of specific people we know, how we relate to them, etc., don't change just because we've figured out a way to efficiently reason about entire groups of people at once. We can now make better predictions about the world, yes, we can track the impact of more-distant factors on our friends, but we don't actually start to care about our friends in a different way in the light of all this.

I actually think this type of change is very common—because individuals' identities are very strongly interwoven with the identities of the groups they belong to. You grow up as a kid and even if you nominally belong to a given (class/political/religious) group, you don't really understand it very well. But then over time you construct your identity as X type of person, and that heavily informs your friendships—they're far less likely to last when they have to bridge very different political/religious/class identities. E.g. how many college students with strong political beliefs would say that it hasn't impacted the way they feel about friends with opposing political beliefs?

Inasmuch as real-life deontologists don't actually shut down when facing a values conflict. They ultimately pick one or the other, in a show of revealed preferences.

I model this just as an agent having two utility functions,  and , and optimizing for their sum .

This is a straightforwardly incorrect model of deontologists; the whole point of deontology is rejecting the utility-maximization framework. Instead, deontologists have a bunch of rules and heuristics (like "don't kill"). But those rules and heuristics are underdefined in the sense that they often endorse different lines of reasoning which give different answers. For example, they'll say pulling the lever in a trolley problem is right, but pushing someone onto the tracks is wrong, but also there's no moral difference between doing something via a lever or via your own hands.

I guess technically you could say that the procedure for resolving this is "do a bunch of moral philosophy" but that's basically equivalent to "do a bunch of systematization".

Suppose we've magically created an agent that already starts our with a perfect world-model. It'll never experience an ontology crisis in its life. This agent would still engage in value translation as I'd outlined.


But optimizing for all humans' welfare would still remain an instrumental goal for it, wholly subordinate to its love for the two specific humans.

Yeah, I totally agree with this. The question is then: why don't translated human goals remain instrumental? It seems like your answer is basically just that it's a design flaw in the human brain, of allowing value drift; the same type of thing which could in principle happen in an agent with a perfect world-model. And I agree that this is probably part of the effect. But it seems to me that, given that humans don't have perfect world-models, the explanation I've given (that systematization makes our values better-defined) is more likely to be the dominant force here.

Load More