Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


Inner Alignment: Explain like I'm 12 Edition

The images in this post seem to be broken.

Thoughts on gradient hacking

What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.

Thoughts on gradient hacking

I discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:

Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.

Formalizing Objections against Surrogate Goals

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

Frequent arguments about alignment

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Frequent arguments about alignment

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?

Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we'll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it's trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.

Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don't have large-scale goals).

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

Advocate: These assumptions seem like common sense to me - e.g. lots of people are already worried about the excesses of capitalism. But even if they're speculative, they're worth putting a lot of effort into understanding and preparing for.

In case it wasn't clear from inside the dialogue, I'm quite sympathetic to both sides of this conversation (indeed, it's roughly a transcript of a debate that I've had with myself a few times). I think more clarity on these topics would be very valuable.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Challenge: know everything that the best go bot knows about go

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Formal Inner Alignment, Prospectus

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

Load More