Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


Shaping safer goals
AGI safety from first principles


Formal Inner Alignment, Prospectus

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

Agency in Conway’s Game of Life

I don't think there is a fundamental difference in kind between trees, bacteria, humans, and hypothetical future AIs

There's at least one important difference: some of these are intelligent, and some of these aren't.

It does seem plausible that the category boundary you're describing is an interesting one. But when you indicate in your comment below that you see the "AI hypothesis" and the "life hypothesis" as very similar, then that mainly seems to indicate that you're using a highly nonstandard definition of AI, which I expect will lead to confusion.

Agency in Conway’s Game of Life

It feels like this post pulls a sleight of hand. You suggest that it's hard to solve the control problem because of the randomness of the starting conditions. But this is exactly the reason why it's also difficult to construct an AI with a stable implementation. If you can do the latter, then you can probably also create a much simpler system which creates the smiley face.

Similarly, in the real world, there's a lot of randomness which makes it hard to carry out tasks. But there are a huge number of strategies for achieving things in the world which don't require instantiating an intelligent controller. For example, trees and bacteria started out small but have now radically reshaped the earth. Do they count as having "perception, cognition, and action that are recognizably AI-like"?

Challenge: know everything that the best go bot knows about go

it's not obvious to me that this is a realistic target

Perhaps I should instead have said: it'd be good to explain to people why this might be a useful/realistic target. Because if you need propositions that cover all the instincts, then it seems like you're basically asking for people to revive GOFAI.

(I'm being unusually critical of your post because it seems that a number of safety research agendas lately have become very reliant on highly optimistic expectations about progress on interpretability, so I want to make sure that people are forced to defend that assumption rather than starting an information cascade.)

Challenge: know everything that the best go bot knows about go

I think at this point you've pushed the word "know" to a point where it's not very well-defined; I'd encourage you to try to restate the original post while tabooing that word.

This seems particularly valuable because there are some versions of "know" for which the goal of knowing everything a complex model knows seems wildly unmanageable (for example, trying to convert a human athlete's ingrained instincts into a set of propositions). So before people start trying to do what you suggested, it'd be good to explain why it's actually a realistic target.

Gradations of Inner Alignment Obstacles

I used to define "agent" as "both a searcher and a controller"

Oh, I really like this definition. Even if it's too restrictive, it seems like it gets at something important.

I'm not sure what you meant by "more compressed".

Sorry, that was quite opaque. I guess what I mean is that evolution is an optimiser but isn't an agent, and in part this has to do with how it's a very distributed process with no clear boundary around it. Whereas when you have the same problem being solved in a single human brain, then that compression makes it easier to point to the human as being an agent separate from its environment.

The rest of this comment is me thinking out loud in a somewhat incoherent way; no pressure to read/respond.

It seems like calling something a "searcher" describes only a very simple interface: at the end of the search, there needs to be some representation of the output which it has found. But that output may be very complex.

Whereas calling something a "controller" describes a much more complex interface between it and its environment: you need to be able to point not just to outcomes, but also to observations and actions. But each of those actions is usually fairly simple for a pure controller; if it's complex, then you need search to find which action to take at each step.

Now, it seems useful to sometimes call evolution a controller. For example, suppose you're trying to wipe out a virus, but it keeps mutating. Then there's a straightforward sense in which evolution is "steering" the world towards states where the virus still exists, in the short term. You could also say that it's steering the world towards states where all organisms have high fitness in the long term, but organisms are so complex that it's easier to treat them as selected outcomes, and abstract away from the many "actions" by evolution which led to this point.

In other words, evolution searches using a process of iterative control. Whereas humans control using a process of iterative search.

(As a side note, I'm now thinking that "search" isn't quite the right word, because there are other ways to do selection than search. For example, if I construct a mathematical proof (or a poem) by writing it one line at a time, letting my intuition guide me, then it doesn't really seem accurate to say that I'm searching over the space of proofs/poems. Similarly, a chain of reasoning may not branch much, but still end up finding a highly specific conclusion. Yet "selection" also doesn't really seem like the right word either, because it's at odds with normal usage, which involves choosing from a preexisting set of options - e.g. you wouldn't say that a poet is "selecting" a poem. How about "design" as an alternative? Which allows us to be agnostic about how the design occurred - whether it be via a control process like evolution, or a process of search, or a process of reasoning.)

Gradations of Inner Alignment Obstacles

To me it sounds like you're describing (some version of) agency, and so the most natural term to use would be mesa-agent.

I'm a bit confused about the relationship between "optimiser" and "agent", but I tend to think of the latter as more compressed, and so insofar as we're talking about policies it seems like "agent" is appropriate. Also, mesa-optimiser is taken already (under a definition which assumes that optimisation is equivalent to some kind of internal search).

Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More

Yann LeCun: ... instrumental subgoals are much weaker drives of behavior than hardwired objectives. Else, how could one explain the lack of domination behavior in non-social animals, such as orangutans.

What's your specific critique of this? I think it's an interesting and insightful point.

Coherence arguments imply a force for goal-directed behavior

My internal model of you is that you believe this approach would not be enough because the utility would not be defined on the internal concepts of the agent. Yet I think it doesn't have so much to be defined on these internal concepts itself than to rely on some assumption about these internal concepts.

Yeah, this is an accurate portrayal of my views. I'd also note that the project of mapping internal concepts to mathematical formalisms was the main goal of the whole era of symbolic AI, and failed badly. (Although the analogy is a little loose, so I wouldn't take it as a decisive objection, but rather a nudge to formulate a good explanation of what they were doing wrong that you will do right.)

I agree more and more with you that the big mistake with using utility functions/reward for thinking about goal-directedness is not so much that they are a bad abstractions, but that they are often used as if every utility function is as meaningful as any other.

I don't think this is an accurate portrayal of my views. I am trying to say that utility functions are a bad abstraction for reasoning about AGI, for similar reasons to why health points are a bad abstraction for reasoning about livers. (I think I agree with the rest of the paragraph though.)

Coherence arguments imply a force for goal-directed behavior

Wouldn't these coherence arguments be pretty awesome? Wouldn't this be a massive step forward in our understanding (both theoretical and practical) of health, damage, triage, and risk allocation?

Insofar as such a system could practically help doctors prioritise, then that would be great. (This seems analogous to how utilities are used in economics.)

But if doctors use this concept to figure out how to treat patients, or using it when designing prostheses for their patients, then I expect things to go badly. If you take HP as a guiding principle - for example, you say "our aim is to build an artificial liver with the most HP possible" - then I'm worried that this would harm your ability to understand what a healthy liver looks like on the level of cells, or tissues, or metabolic pathways, or roles within the digestive system. Because HP is just not a well-defined concept at that level of resolution.

Analogously, it seems very hard to have a good understanding of goals without talking about concepts, instincts, desires, etc, and the roles that all of these play within cognition as a whole - concepts which people just don't talk about much around here. I hypothesise that this is partly because they think they can talk about utilities instead. But when people reason about how to design AGIs in terms of utilities, on the basis of coherence theorems, then I think they're making a very similar mistake as a doctor who tries to design artificial livers based on the theoretical triage virtues of HP.

Load More