Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


Thoughts on gradient hacking

I discuss the possibility of it going in some other direction when I say "The two most salient options to me". But the bit of Evan's post that this contradicts is:

Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there.

Formalizing Objections against Surrogate Goals

Interesting report :) One quibble:

For one, our AIs can only use “things like SPI” if we actually formalize the approach

I don't see why this is the case. If it's possible for humans to start using things like SPI without a formalisation, why couldn't AIs too? (I agree it's more likely that we can get them to do so if we formalise it, though.)

Frequent arguments about alignment

Whether this is a point for the advocate or the skeptic depends on whether advances in RL from human feedback unlock other alignment work more than they unlock other capabilities work. I think there's room for reasonable disagreement on this question, although I favour the former.

Frequent arguments about alignment

Skeptic: It seems to me that the distinction between "alignment" and "misalignment" has become something of a motte and bailey. Historical arguments that AIs would be misaligned used it in sense 1: "AIs having sufficiently general and large-scale motivations that they acquire the instrumental goal of killing all humans (or equivalently bad behaviour)". Now people are using the word in sense 2: "AIs not quite doing what we want them to do". But when our current AIs aren't doing quite what we want them to do, is that mainly evidence that future, more general systems will be misaligned1 (which I agree is bad) or misaligned2?

Advocate: Concepts like agency are continuous spectra. GPT-3 is a little bit agentic, and we'll eventually build AGIs that are much more agentic. Insofar as GPT-3 is trying to do something, it's trying to do the wrong thing. So we should expect future systems to be trying to do the wrong thing in a much more worrying way (aka be misaligned1) for approximately the same reason: that we trained them on loss functions that incentivised the wrong thing.

Skeptic: I agree that this is possible. But what should our update be after observing large language models? You could look at the difficulties of making GPT-3 do exactly what we want, and see this as evidence that misalignment is a big deal. But actually, large language models seem like evidence against misalignment1 being a big deal (because they seem to be quite intelligent without being very agentic, but the original arguments for worrying about misalignment1 relied on the idea that intelligence and agency are tightly connected, making it very hard to build superintelligent systems which don't have large-scale goals).

Advocate: Even if that's true for the original arguments, it's not for more recent arguments.

Skeptic: These newer arguments rely on assumptions about economic competition and coordination failures which seem quite speculative to me, and which haven't been vetted very much.

Advocate: These assumptions seem like common sense to me - e.g. lots of people are already worried about the excesses of capitalism. But even if they're speculative, they're worth putting a lot of effort into understanding and preparing for.

In case it wasn't clear from inside the dialogue, I'm quite sympathetic to both sides of this conversation (indeed, it's roughly a transcript of a debate that I've had with myself a few times). I think more clarity on these topics would be very valuable.

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Challenge: know everything that the best go bot knows about go

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Formal Inner Alignment, Prospectus

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

Formal Inner Alignment, Prospectus

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

Challenge: know everything that the best go bot knows about go

The trained AlphaZero model knows lots of things about Go, in a comparable way to how a dog knows lots of things about running.

But the algorithm that gives rise to that model can know arbitrarily few things. (After all, the laws of physics gave rise to us, but they know nothing at all.)

Challenge: know everything that the best go bot knows about go

I'd say that this is too simple and programmatic to be usefully described as a mental model. The amount of structure encoded in the computer program you describe is very small, compared with the amount of structure encoded in the neural networks themselves. (I agree that you can have arbitrarily simple models of very simple phenomena, but those aren't the types of models I'm interested in here. I care about models which have some level of flexibility and generality, otherwise you can come up with dumb counterexamples like rocks "knowing" the laws of physics.)

As another analogy: would you say that the quicksort algorithm "knows" how to sort lists? I wouldn't, because you can instead just say that the quicksort algorithm sorts lists, which conveys more information (because it avoids anthropomorphic implications). Similarly, the program you describe builds networks that are good at Go, and does so by making use of the rules of Go, but can't do the sort of additional processing with respect to those rules which would make me want to talk about its knowledge of Go.

Load More