Richard Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


ARC's first technical report: Eliciting Latent Knowledge

Ah, that makes sense. In the section where you explain the steps of the game, I interpreted the comments in parentheses as further explanations of the step, rather than just a single example. (In hindsight the latter interpretation is obvious, but I was reading quickly - might be worth making this explicit for others who are doing the same.) So I thought that Bayes nets were built into the methodology. Apologies for the oversight!

I'm still a little wary of how much the report talks about concepts in a humans' Bayes net without really explaining why this is anywhere near a sensible model of humans, but I'll have another read through and see if I can pin down anything that I actively disagree with (since I do agree that it's useful to start off with very simple assumptions).

ARC's first technical report: Eliciting Latent Knowledge

Speaking just for myself, I think about this as an extension of the worst-case assumption. Sure, humans don't reason using Bayes nets -- but if we lived in a world where the beings whose values we want to preserve did reason about the world using a Bayes net, that wouldn't be logically inconsistent or physically impossible, and we wouldn't want alignment to fail in that world.

If you solve something given worst-case assumptions, you've solved it for all cases. Whereas if you solve it for one specific case (e.g. Bayes nets) then it may still fail if that's not the case we end up facing.

There's no obvious way that a messier model of human reasoning makes ELK easier.

Doesn't this imply that a Bayes-net model isn't the worst case?

EDIT: I guess it depends on whether "the human isn't well-modelled using a Bayes net" is a possible response the breaker could give. But that doesn't seem like it fits the format of finding a test case where the builder's strategy fails (indeed, "bayes nets" seems built into the definition of the game).

ARC's first technical report: Eliciting Latent Knowledge

We’ll assume the humans who constructed the dataset also model the world using their own internal Bayes net.

This seems like a crucial premise of the report; could you say more about it? You discuss why a model using a Bayes net might be "oversimplified and unrealistic", but as far as I can tell you don't talk about why this is a reasonable model of human reasoning.

Interlude: Agents as Automobiles

I guess I just don't feel like you've established that it would have been reasonable to have credence above 90% in either of those cases. Like, it sure seems obvious to me that computers and automobiles are super useful. But I have a huge amount of evidence now about both of those things that I can't really un-condition on. So, given that I know how powerful hindsight bias can be, it feels like I'd need to really dig into the details of possible alternatives before I got much above 90% based on facts that were known back then.

(Although this depends on how we're operationalising the claims. If the claim is just that there's something useful which can be done with computers - sure, but that's much less interesting. There's also something useful that can be done with quantum computers, and yet it seems pretty plausible that they remain niche and relatively uninteresting.)

Interlude: Agents as Automobiles

Interesting post. Overall, though, it feels like you aren't taking hindsight bias seriously enough. E.g. as one example:

Some people thought battleships would beat carriers. Others thought that the entire war would be won from the air. Predicting the future is hard; we shouldn’t be confident. Therefore, we shouldn’t assign more than 90% credence to the claim that powerful, portable computers (assuming we figure out how to build them) will be militarily useful, e.g. in weapon guidance systems or submarine sensor suites.

In this particular case, an alternative is to have almost all of the computation done in a central location, with the results sent wherever they are needed. This is how computers worked for decades, and it's probably how computers will work in the future. So it seems overconfident for someone in 1950 to assign more than 90% credence to portability being crucial.

Also, the whole setup of picking something which we already know to be widespread (cars) and then applying Joe's arguments to it, seems like it's shouldn't tell us much. If Joe were saying 1% yes, 99% no for incentives to build APS systems, then the existence of counterexamples like cars which have similar "no" arguments would be compelling. But he's saying 80% yes, 20% no, and so the fact that there are some cases where his "no" arguments fail is unsurprising - according to him, "no" arguments of this strength should fail approximately 80% of the time. 

There’s a selection effect that biases us towards thinking our intuition about these things is worse than it is

This is an interesting argument, but it applies most strongly to cases where we mainly became interested in the problem after it started looking tractable (e.g. AI for art, AI for Go). In other cases, such as chess or Turing tests, people thought of these as being of central importance well before we had any feasible way to approach them. So if they tend to have much quicker shortcuts than expected, that's good evidence that our intuitions are bad.

Conversation on technology forecasting and gradualism

I didn't push this point at the time, but Paul's claim that "GPT-3 + 5 person-years of engineering effort [would] foom" seems really wild to me, and probably a good place to poke at his model more. Is this 5 years of engineering effort and then humans leaving it alone with infinite compute? Or are the person-years of engineering doled out over time?

Unlike Eliezer, I do think that language models not wildly dissimilar to our current ones will be able to come up with novel insights about ML, but there's a long way between "sometimes comes up with novel insights" and "can run a process of self-improvement with increasing returns". I'm pretty confused about how a few years of engineering could get GPT-3 to a point where it could systematically make useful changes to itself (unless most of the work is actually being done by a program search which consumes astronomical amounts of compute).

Biology-Inspired AGI Timelines: The Trick That Never Works

The two extracts from this post that I found most interesting/helpful:

The problem is that the resource gets consumed differently, so base-rate arguments from resource consumption end up utterly unhelpful in real life.  The human brain consumes around 20 watts of power.  Can we thereby conclude that an AGI should consume around 20 watts of power, and that, when technology advances to the point of being able to supply around 20 watts of power to computers, we'll get AGI?

I'm saying that Moravec's "argument from comparable resource consumption" must be in general invalid, because it Proves Too Much.  If it's in general valid to reason about comparable resource consumption, then it should be equally valid to reason from energy consumed as from computation consumed, and pick energy consumption instead to call the basis of your median estimate.

You say that AIs consume energy in a very different way from brains?  Well, they'll also consume computations in a very different way from brains!  The only difference between these two cases is that you know something about how humans eat food and break it down in their stomachs and convert it into ATP that gets consumed by neurons to pump ions back out of dendrites and axons, while computer chips consume electricity whose flow gets interrupted by transistors to transmit information.  Since you know anything whatsoever about how AGIs and humans consume energy, you can see that the consumption is so vastly different as to obviate all comparisons entirely.

You are ignorant of how the brain consumes computation, you are ignorant of how the first AGIs built would consume computation, but "an unknown key does not open an unknown lock" and these two ignorant distributions should not assert much internal correlation between them.

Even without knowing the specifics of how brains and future AGIs consume computing operations, you ought to be able to reason abstractly about a directional update that you would make, if you knew any specifics instead of none.  If you did know how both kinds of entity consumed computations, if you knew about specific machinery for human brains, and specific machinery for AGIs, you'd then be able to see the enormous vast specific differences between them, and go, "Wow, what a futile resource-consumption comparison to try to use for forecasting."


You can think of there as being two biological estimates to anchor on, not just one.  You can imagine there being a balance that shifts over time from "the computational cost for evolutionary biology to invent brains" to "the computational cost to run one biological brain".

In 1960, maybe, they knew so little about how brains worked that, if you gave them a hypercomputer, the cheapest way they could quickly get AGI out of the hypercomputer using just their current knowledge, would be to run a massive evolutionary tournament over computer programs until they found smart ones, using 10^43 operations.

Today, you know about gradient descent, which finds programs more efficiently than genetic hill-climbing does; so the balance of how much hypercomputation you'd need to use to get general intelligence using just your own personal knowledge, has shifted ten orders of magnitude away from the computational cost of evolutionary history and towards the lower bound of the computation used by one brain.  In the future, this balance will predictably swing even further towards Moravec's biological anchor, further away from Somebody on the Internet's biological anchor.


Ngo and Yudkowsky on AI capability gains

My recommended policy in cases where this applies is "trust your intuitions and operate on the assumption that you're not a crackpot." 

Oh, certainly Eliezer should trust his intuitions and believe that he's not a crackpot. But I'm not arguing about what the person with the theory should believe, I'm arguing about what outside observers should believe, if they don't have enough time to fully download and evaluate the relevant intuitions. Asking the person with the theory to give evidence that their intuitions track reality isn't modest epistemology.

Ngo and Yudkowsky on AI capability gains

the easiest way to point out why they are dumb is with counterexamples. We can quickly "see" the counterexamples. E.g., if you're trying to see AGI as the next step in capitalism, you'll be able to find counterexamples where things become altogether different (misaligned AI killing everything; singleton that brings an end to the need to compete).

I'm not sure how this would actually work. The proponent of the AGI-capitalism analogy might say "ah yes, AGI killing everyone is another data point on the trend of capitalism becoming increasingly destructive". Or they might say (as Marx did) that capitalism contains the seeds of its own destruction. Or they might just deny that AGI will play out the way you claim, because their analogy to capitalism is more persuasive than your analogy to humans (or whatever other reasoning you're using). How do you then classify this as a counterexample rather than a "non-central (but still valid) manifestation of the theory"?

My broader point is that these types of theories are usually sufficiently flexible that they can "predict" most outcomes, which is why it's so important to pin them down by forcing them to make advance predictions.

On the rest of your comment, +1. I think that one of the weakest parts of Eliezer's argument was when he appealed to the difference between von Neumann and the village idiot in trying to explain why the next step above humans will be much more consequentialist than most humans (although unfortunately I failed to pursue this point much in the dialogue).

Ngo and Yudkowsky on AI capability gains

Your comment is phrased as if the object-level refutations have been tried, while conveying the meta-level intuitions hasn't been tried. If anything, it's the opposite: the sequences (and to some extent HPMOR) are practically all content about how to think, whereas Yudkowsky hasn't written anywhere near as extensively on object-level AI safety.

This has been valuable for community-building, but less so for making intellectual progress - because in almost all domains, the most important way to make progress is to grapple with many object-level problems, until you've developed very good intuitions for how those problems work. In the case of alignment, it's hard to learn things from grappling with most of these problems, because we don't have signals of when we're going in the right direction. Insofar as Eliezer has correct intuitions about when and why attempted solutions are wrong, those intuitions are important training data.

By contrast, trying to first agree on very high-level epistemological principles, and then do the object-level work, has a very poor track record. See how philosophy of science has done very little to improve how science works; and how reading the sequences doesn't improve people's object-level rationality very much.

I model you as having a strong tendency to abstract towards higher-level discussion of epistemology in order to understand things. (I also have a strong tendency to do this, but I think yours is significantly stronger than mine.) I expect that there's just a strong clash of intuitions here, which would be hard to resolve. But one prompt which might be useful: why aren't epistemologists making breakthroughs in all sorts of other domains?

Load More