Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


AGI safety from first principles
Shaping safer goals


Some thoughts on risks from narrow, non-agentic AI

To clarify your position: if I train a system that makes good predictions over 1 minute and 10 minutes and 100 minutes, is your position that there's not much reason that this system would make a good prediction over 1000 minutes? Analogously, if I train a system by meta-learning to get high rewards over a wide range of simulated environments, is your position that there's not much reason to think it will try to get high rewards when deployed in the real world?

In most of the cases you've discussed, trying to do tasks over much longer time horizons involves doing a very different task. Reducing reported crime over 10 minutes and reducing reported crime over 100 minutes have very little to do with reducing reported crime over a year or 10 years. The same is true for increasing my wealth, or increasing my knowledge (which over 10 minutes involves telling me things, but over a year might involve doing novel scientific research). I tend to be pretty optimistic about AI motivations generalising, but this type of generalisation seems far too underspecified. "Making predictions" is perhaps an exception, insofar as it's a very natural concept, and also one which transfers very straightforwardly from simulations to reality. But it probably depends a lot on what type of predictions we're talking about.

On meta-learning: it doesn't seem realistic to think about an AI "trying to get high rewards" on tasks where the time horizon is measured in months or years. Instead it'll try to achieve some generalisation of the goals it learned during training. But as I already argued, we're not going to be able to train on single tasks which are similar enough to real-world long-term tasks that motivations will transfer directly in any recognisable way.

Insofar as ML researchers think about this, I think their most common position is something like "we'll train an AI to follow a wide range of instructions, and then it'll generalise to following new instructions over longer time horizons". This makes a lot of sense to me, because I expect we'll be able to provide enough datapoints (mainly simulated datapoints, plus language pre-training) to pin down the concept "follow instructions" reasonably well, whereas I don't expect we can provide enough datapoints to pin down a motivation like "reduce reports of crime". (Note that I also think that we'll be able to provide enough datapoints to incentivise influence-seeking behaviour, so this isn't a general argument against AI risk, but rather an argument against the particular type of task-specific generalisation you describe.)

In other words, we should expect generalisation to long-term tasks to occur via a general motivation to follow our instructions, rather than on a task-specific basis, because the latter is so underspecified. But generalisation via following instructions doesn't have a strong bias towards easily-measurable goals.

I agree that it's only us who are operating by trial and error---the system understands what it's doing. I don't think that undermines my argument. The point is that we pick the system, and so determine what it's doing, by trial and error, because we have no understanding of what it's doing (under the current paradigm). For some kinds of goals we may be able to pick systems that achieve those goals by trial and error (modulo empirical uncertainty about generalization, as discussed in the second part). For other goals there isn't a plausible way to do that.

I think that throughout your post there's an ambiguity between two types of measurement. Type one measurements are those which we can make easily enough to use as a feedback signal for training AIs. Type two measurements are those which we can make easily enough to tell us whether an AI we've deployed is doing a good job. In general many more things are type-two-measurable than type-one-measurable, because training feedback needs to be very cheap. So if we train an AI on type one measurements, we'll usually be able to use type two measurements to evaluate whether it's doing a good job post-deployment. And that AI won't game those type two measurements even if it generalises its training signal to much longer time horizons, because it will never have been trained on type two measurements.

These seem like the key disagreements, so I'll leave off here, to prevent the thread from branching too much. (Edited one out because I decided it was less important).

Literature Review on Goal-Directedness

Really, the only issue for our purposes with this definition is that it focuses on how goal-directedness emerges, instead of what it entails for a system. Hence it gives less of a handle to predict the behavior of a system than Dennett’s intentional stance for example.

Another way to talk about this distinction is between definitions that allow you to predict the behaviour of agents which you haven't observed yet given how they were trained, versus definitions of goal-directedness which allow you to predict the future behaviour of an existing system given its previous behaviour.

I claim that the former is more important for our current purposes, for three reasons. Firstly, we don't have any AGIs to study, and so when we ask the question of how likely it is that AGIs will be goal-directed, we need to talk about the way in which that trait might emerge.

Secondly, because of the possibility of deceptive alignment, it doesn't seem like focusing on observed behaviour is sufficient for analysing goal-directedness.

Thirdly, suppose that we build a system that's goal-directed in a dangerous way. What do we do then? Well, we need to know why that goal-directedness emerges, and how to change the training regime so that it doesn't happen again.

Some thoughts on risks from narrow, non-agentic AI

In the second half of WFLL, you talk about "systems that have a detailed understanding of the world, which are able to adapt their behavior in order to achieve specific goals". Does the first half of WFLL also primarily refer to systems with these properties? And if so, does "reasoning honed by trial-and-error" refer to the reasoning that those systems do?

If yes, then this undermines your core argument that "[some things] can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes", because "systems that have a detailed understanding of the world" don't need to operate by trial and error; they understand what they're doing.

We do need to train them by trial and error, but it's very difficult to do so on real-world tasks which have long feedback loops, like most of the ones you discuss. Instead, we'll likely train them to have good reasoning skills on tasks which have short feedback loops, and then transfer them to real-world with long feedback loops. But in that case, I don't see much reason why systems that have a detailed understanding of the world will have a strong bias towards easily-measurable goals on real-world tasks with long feedback loops. (Analogously: when you put humans in a new domain, and give them tasks and feedback via verbal instructions, then we can quickly learn sophisticated concepts in that new domain, and optimise for those, not just the easily-measured concepts in that new domain.)

I'm pretty agnostic on whether AI will in fact be optimizing for the easily measured objectives used in training or for unrelated values that arise naturally in the learning process (or more likely some complicated mix), and part of my point is that it doesn't seem to much mater.

Why is your scenario called "You get what you measure" if you're agnostic about whether we actually get what we measure, even on the level of individual AIs?

Or do you mean part 1 to be the case where we do get what we measure, and part 2 to be the case where we don't?

I'm saying: it's easier to pursue easily-measured goals, and so successful organizations and individuals tend to do that and to outcompete those whose goals are harder to measure (and to get better at / focus on the parts of their goals that are easy to measure, etc.). I'm not positing any change in the strength of competition, I'm positing a change in the extent to which goals that are easier to measure are in fact easier to pursue.

Firstly, I think this is only true for organisations whose success is determined by people paying attention to easily-measured metrics, and not by reality. For example, an organisation which optimises for its employees having beliefs which are correct in easily-measured ways will lose out to organisations where employees think in useful ways. An organisation which optimises for revenue growth is more likely to go bankrupt than an organisation which optimises for sustainable revenue growth. An organisation which optimises for short-term customer retention loses long-term customer retention. Etc.

The case in which this is more worrying is when an organisation's success is determined by (for example) whether politicians like it, and politicians only pay attention to easily-measurable metrics. In this case, organisations which pursue easily-measured goals will be more successful than ones which pursue the goals the politicians actually want to achieve. This is why I make the argument that actually the pressure on politicians to pursue easily-measurable metrics is pretty weak (hence why they're ignoring most economists' recommendations on how to increase GDP).

I don't disagree with [AI improving our ability to steer our future] at all. The point is that right now human future-steering is basically the only game in town. We are going to introduce inhuman reasoning that can also steer the future, and over time human reasoning will lose out in relative terms. That's compatible with us benefiting enormously, if all of those benefits also accrue to automated reasoners---as your examples seem to. We will try to ensure that all this new reasoning will benefit humanity, but I describe two reasons that might be difficult.

I agree that you've described some potential harms; but in order to make this a plausible long-term concern, you need to give reasons to think that the harms outweigh the benefits of AI enhancing (the effective capabilities of) human reasoning. If you'd written a comparable post a few centuries ago talking about how human physical power will lose out to inhuman physical power, I would have had the same complaint.

(If you classify all future-steering machinery as "agentic" then evidently I'm talking about agents and I agree with the informal claim that "non-agentic" reasoning isn't concerning.)

I classify Facebook's newsfeed as future-steering in a weak sense (it steers the future towards political polarisation), but non-agentic. Do you agree with this? If so, do you agree that if FB-like newsfeeds became prominent in many ways that would not be very concerning from a longtermist perspective?

Why I'm excited about Debate

suppose we sorted out a verbal specification of an aligned AI and had a candidate FAI coded up - could we then use Debate on the question "does this candidate match the verbal specification?"

I'm less excited about this, and more excited about candidate training processes or candidate paradigms of AI research (for example, solutions to embedded agency). I expect that there will be a large cluster of techniques which produce safe AGIs, we just need to find them - which may be difficult, but hopefully less difficult with Debate involved.

Why I'm excited about Debate

I think I agree with all of this. In fact, this argument is one reason why I think Debate could be valuable, because it will hopefully increase the maximum complexity of arguments that humans can reliably evaluate.

This eventually fails at some point, but hopefully it fails after the point at which we can use Debate to solve alignment in a more scalable way. (I don't have particularly strong intuitions about whether this hope is justified, though.)

Why I'm excited about Debate

If arguments had no meaning but to argue other people into things, if they were being subject only to neutral selection or genetic drift or mere conformism, there really wouldn't be any reason for "the kind of arguments humans can be swayed by" to work to build a spaceship.  We'd just end up with some arbitrary set of rules fixed in place.

I agree with this. My position is not that explicit reasoning is arbitrary, but that it developed via an adversarial process where arguers would try to convince listeners of things, and then listeners would try to distinguish between more and less correct arguments. This is in contrast with theories of reason which focus on the helpfulness of reason in allowing individuals to discover the truth by themselves, or theories which focus on its use in collaboration.

Here's how Sperber and Mercier describe their argument: 

Reason is not geared to solitary use, to arriving at better beliefs and decisions on our own. What reason does, rather, is help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us.

I can see how my summary might give a misleading impression; I'll add an edit to clarify. Does this resolve the disagreement?

Radical Probabilism

DP: (sigh...) OK. I'm still never going to design an artificial intelligence to have uncertain observations. It just doesn't seem like something you do on purpose.

What makes you think that having certain observations is possible for an AI?

Imitative Generalisation (AKA 'Learning the Prior')

Ooops, yes, this seems correct. I'll edit mine accordingly.

Imitative Generalisation (AKA 'Learning the Prior')

A few things that I found helpful in reading this post:

  • I mentally replaced D with "the past" and D' with "the future".
  • I mentally replaced z with "a guide to reasoning about the future".

This gives us a summary something like:

We want to understand the future, based on our knowledge of the past. However, training a neural net on the past might not lead it to generalise well about the future. Instead, we can train a network to be a guide to reasoning about the future, by evaluating its outputs based on how well humans with access to it can reason about the past, plus how well humans expect it to generalise to the future, plus immense amounts of interpretability work. (Note that this summary was originally incorrect, and has been modified in response to Lanrian's corrections below.)

Some concerns that arise from my understanding of this proposal:

  • It seems like the only thing stopping z from primarily containing object-level knowledge about the world is the human prior about the unlikelihood of object-level knowledge. But humans are really bad at assigning priors even to relatively simple statements - this is the main reason that we need science.
  • z will consist of a large number of claims, but I have no idea how to assign a prior to the conjunction of many big claims about the world, even in theory. That prior can't calculated recursively, because there may be arbitrarily-complicated interactions between different components of z.
  • Consider the following proposal: "train an oracle to predict the future, along with an explanation of its reasoning. Reward it for predicting correctly, and penalise it for explanations that sound fishy". Is there an important difference between this and imitative generalisation?
  • An agent can "generalise badly" because it's not very robust, or because it's actively pursuing goals that are misaligned with those of humans. It doesn't seem like this proposal distinguishes between these types of failures. Is this distinction important in motivating the proposal?
Eight claims about multi-agent AGI safety

This all seems straightforwardly correct, so I've changed the line in question accordingly. Thanks for the correction :)

One caveat: technical work to address #8 currently involves either preventing AGIs from being misaligned in ways that lead them to make threats, or preventing AGIs from being aligned in ways which make them susceptible to threats. The former seems to qualify as an aspect of the "alignment problem", the latter not so much. I should have used the former as an example in my original reply to you, rather than using the latter.

Load More