Richard Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

Curious who just strong-downvoted and why.

Another concern about safety cases: they feel very vulnerable to regulatory capture. You can imagine a spectrum from "legislate that AI labs should do X, Y, Z, as enforced by regulator R" to "legislate that AI labs need to provide a safety case, which we define as anything that satisfies regulator R". In the former case, you lose flexibility, but the remit of R is very clear. In the latter case, R has a huge amount of freedom to interpret what counts as an adequate safety case. This can backfire badly if R is not very concerned about the most important threat models; and even if they are, the flexibility makes it easier for others to apply political pressure to them (e.g. "find a reason to approve this safety case, it's in the national interest").

Richard Ngo26-16

Some opinions about AI and epistemology:

  1. One reasons that many rationalists have such strong views about AI is that they are wrong about epistemology. Specifically, bayesian rationalism is a bad way to think about complex issues. 
  2. A better approach is meta-rationality. To summarize one guiding principle of (my version of) meta-rationality in a single sentence: if something doesn't make sense in the context of group rationality, it probably doesn't make sense in the context of individual rationality either.
  3. For example: there's no privileged way to combine many people's opinions into a single credence. You can average them, but that loses a lot of information. Or you can get them to bet on a prediction market, but that depends on a lot on details of the individuals' betting strategies. The group might settle on a number to help with planning and communication, but it's only a lossy summary of many different beliefs and models. Similarly, we should think of individuals' credences as lossy summaries of different opinions from different underlying models that they have.
  4. How does this apply to AI? Suppose we each think of ourselves as containing many different subagents that focus on understanding the world in different ways - e.g. studying different disciplines, using different styles of reasoning, etc. The subagent that thinks about AI from first principles might come to a very strong opinion. But this doesn't mean that the other subagents should fully defer to it (just as having one very confident expert in a room of humans shouldn't cause all the other humans to elect them as the dictator). E.g. maybe there's an economics subagent who will remain skeptical unless the AI arguments can be formulated in ways that are consistent with their knowledge of economics, or the AI subagent can provide evidence that is legible even to those other subagents (e.g. advance predictions).
  5. In my debate with Eliezer, he didn't seem to appreciate the importance of advance predictions; I think the frame of "highly opinionated subagents should convince other subagents to trust them, rather than just seizing power" is an important aspect of what he's missing. I think of rationalism as trying to form a single fully-consistent world-model; this has many of the same pitfalls as a country which tries to get everyone to agree on a single ideology. Even when that ideology is broadly correct, you'll lose a bunch of useful heuristics and intuitions that help actually get stuff done, because ideological conformity is prioritized.
  6. This perspective helps frame the debate about what our "base rate" for AI doom should be. I've been in a number of arguments that go roughly like (edited for clarity):
    Me: "Credences above 90% doom can't be justified given our current state of knowledge"
    Them: "But this is an isolated demand for rigor, because you're fine with people claiming that there's a 90% chance we survive. You're assuming that survival is the default, I'm assuming that doom is the default; these are symmetrical positions."
    But in fact there's no one base rate; instead, different subagents with different domains of knowledge will have different base rates. That will push P(doom) lower because most frames from most disciplines, and most styles of reasoning, don't predict doom. That's where the asymmetry which makes 90% doom a much stronger prediction than 90% survival comes from.
  7. This perspective is broadly aligned with a bunch of stuff that Scott Garrabrant and Abram Demski have written about (e.g. geometric rationality, Garrabrant induction). I don't think the ways I'm applying it to AI risk debates straightforwardly falls out of their more technical ideas; but I do expect that more progress on agent foundations will make it easier to articulate ideas like the ones above.

Part of my view here is that ARA agents will have unique affordances that no human organization will have had before (like having truly vast, vast amounts of pretty high skill labor).

The more labor they have, the more detectable they are, and the easier they are to shut down. Also, are you picturing them gaining money from crimes, then buying compute legitimately? I think the "crimes" part is hard to stop but the "paying for compute" part is relatively easy to stop.

My guess is that you need to be a decent but not amazing software engineer to ARA.

Yeah, you're probably right. I still stand by the overall point though.

1) It’s not even clear people are going to try to react in the first place.

I think this just depends a lot on how large-scale they are. If they are using millions of dollars of compute, and are effectively large-scale criminal organizations, then there are many different avenues by which they might get detected and suppressed.

If we don't solve alignment and we implement a pause on AI development in labs, the ARA AI may still continue to develop.

A world which can pause AI development is one which can also easily throttle ARA AIs.

The central point is:

  • At some point, ARA is unshutdownable unless you try hard with a pivotal cleaning act. We may be stuck with a ChaosGPT forever, which is not existential, but pretty annoying. People are going to die.
  • the ARA evolves over time. Maybe this evolution is very slow, maybe fast. Maybe it plateaus, maybe it does not plateau. I don't know
  • This may take an indefinite number of years, but this can be a problem

This seems like a weak central point. "Pretty annoying" and some people dying is just incredibly small compared with the benefits of AI. And "it might be a problem in an indefinite number of years" doesn't justify the strength of the claims you're making in this post, like "we are approaching a point of no return" and "without a treaty, we are screwed". 

An extended analogy: suppose the US and China both think it might be possible to invent a new weapon far more destructive than nuclear weapons, and they're both worried that the other side will invent it first. Worrying about ARAs feels like worrying about North Korea's weapons program. It could be a problem in some possible worlds, but it is always going to be much smaller, it will increasingly be left behind as the others progress, and if there's enough political will to solve the main problem (US and China racing) then you can also easily solve the side problem (e.g. by China putting pressure on North Korea to stop).

you can find some comments I've made about this by searching my twitter

Link here, and there are other comments in the same thread. Was on my laptop, which has twitter blocked, so couldn't link it myself before.

However, it seems to me like ruling out ARA is a relatively naturally way to mostly rule out relatively direct danger.

This is what I meant by "ARA as a benchmark"; maybe I should have described it as a proxy instead. Though while I agree that ARA rules out most danger, I think that's because it's just quite a low bar. The sort of tasks involved in buying compute etc are ones most humans could do. Meanwhile more plausible threat models involve expert-level or superhuman hacking. So I expect a significant gap between ARA and those threat models.

once you do have ARA ability, you just need some moderately potent self-improvement ability (including training successor models) for the situation to look reasonably scary

You'd need either really good ARA or really good self-improvement ability for an ARA agent to keep up with labs given the huge compute penalty they'll face, unless there's a big slowdown. And if we can coordinate on such a big slowdown, I expect we can also coordinate on massively throttling potential ARA agents.

Answer by Richard Ngo3631

I think the opposite: ARA is just not a very compelling threat model in my mind. The key issue is that AIs that do ARA will need to be operating at the fringes of human society, constantly fighting off the mitigations that humans are using to try to detect them and shut them down. While doing all that, in order to stay relevant, they'll need to recursively self-improve at the same rate at which leading AI labs are making progress, but with far fewer computational resources. Meanwhile, if they grow large enough to be spending serious amounts of money, they'll need to somehow fool standard law enforcement and general societal scrutiny.

Superintelligences could do all of this, and ARA of superintelligences would be pretty terrible. But for models in the broad human or slightly-superhuman ballpark, ARA seems overrated, compared with threat models that involve subverting key human institutions. Remember, while the ARA models are trying to survive, there will be millions of other (potentially misaligned) models being deployed deliberately by humans, including on very sensitive tasks (like recursive self-improvement). These seem much more concerning.

Why then are people trying to do ARA evaluations? Well, ARA was originally introduced primarily as a benchmark rather than a threat model. I.e. it's something that roughly correlates with other threat models, but is easier and more concrete to measure. But, predictably, this distinction has been lost in translation. I've discussed this with Paul and he told me he regrets the extent to which people are treating ARA as a threat model in its own right.

Separately, I think the "natural selection favors AIs over humans" argument is a fairly weak one; you can find some comments I've made about this by searching my twitter.

You can think of this as a way of getting around the problem of fully updated deference, because the AI is choosing a policy based on what that policy would have done in the full range of hypothetical situations, and so it never updates away from considering any given goal. The cost, of course, is that we don't know how to actually pin down these hypotheticals.

Hypothesis: there's a way of formalizing the notion of "empowerment" such that an AI with the goal of empowering humans would be corrigible.

This is not straightforward, because an AI that simply maximized human POWER (as defined by Turner et al.) wouldn't ever let the humans spend that power. Intuitively, though, there's a sense in which a human who can never spend their power doesn't actually have any power. Is there a way of formalizing that intuition?

The direction that seems most promising is in terms of counterfactuals (or, alternatively, Pearl's do-calculus). Define the power of a human with respect to a distribution of goals G as the average ability of a human to achieve their goal if they'd had a goal sampled from G (alternatively: under an intervention that changed their goal to one sampled from G). Then an AI with a policy of never letting humans spend their resources would result in humans having low power. Instead, a human-power-maximizing AI would need to balance between letting humans pursue their goals, and preventing humans from doing self-destructive actions. The exact balance would depend on G, but one could hope that it's not very sensitive to the precise definition of G (especially if the AI isn't actually maximizing human power, but is more like a quantilizer, or is optimizing under pessimistic assumptions).

The problem here is that these counterfactuals aren't very clearly-defined. E.g. imagine the hypothetical world where humans valued paperclips instead of love. Even a little knowledge of evolution would tell you that this hypothetical is kinda crazy, and maybe the question "what would the AI be doing in this world?" has no sensible answer (or maybe the answer would be "it would realize it's in a weird hypothetical world and behave accordingly"). Similarly, if we model this using the do-operation, the best policy is something like "wait until the human's goals suddenly and inexplicably change, then optimize hard for their new goal".

Having said that, in some sense what it means to model someone as an agent is that you can easily imagine them pursuing some other goal. So the counterfactuals above might not be too unnatural; or at least, no more unnatural than any other intervention modeled by Pearl's do-operator. Overall this line of inquiry seems promising and I plan to spend more time thinking about it.

Load More