Richard Ngo

Former AI safety research engineer, now AI governance researcher at OpenAI. Blog:


Shaping safer goals
AGI safety from first principles

Wiki Contributions


I just stumbled upon the Independence of Pareto dominated alternatives criterion; does the ROSE value have this property? I'm pattern-matching it as related to disagreement-point invariance, but haven't thought about this at all.

Yeah, I agree I convey the implicit prediction that, even though not all one-month tasks will fall at once, they'll be closer than you would otherwise expect not using this framework.

I think I still disagree with your point, as follows: I agree that AI will soon do passably well at summarizing 10k word books, because the task is not very "sharp" - i.e. you get gradual rather than sudden returns to skill differences. But I think it will take significantly longer for AI to beat the quality of summary produced by a median expert in 1 month, because that expert's summary will in fact explore a rich hierarchical interconnected space of concepts from the novel (novel concepts, if you will).

Seems like there's a bunch of interesting stuff here, though some of it is phrased overly strongly.

E.g. "mechanistic interpretability requires program synthesis, program induction, and/or programming language translation" seems possible but far from obvious to me. In general I think that having a deep understanding of small-scale mechanisms can pay off in many different and hard-to-predict ways. Perhaps it's appropriate to advocate for MI researchers to pay more attention to these fields, but calling this an example of "reinventing", "reframing" or "renaming" seems far too strong.

Same for "we should not expect solving toy MI problems using humans to help with real world MI problems" - there are a huge number of cases in science where solving toy problems has led to theories that help solve real-world problems.

Ramanujan et al. (2020) showed that randomly initialized networks could be “trained” simply by pruning all of the weights that harmed performance on the task of interest. The resulting subnetwork may accomplish a task of interest, but only in a frivolous sense, and it should not be expected to generalize.

I can kinda see the intuition here, but could you explain why we shouldn't expect this to generalize?

My default (very haphazard) answer: 10,000 seconds in a day; we're at 1-second AGI now; I'm speculating 1 OOM every 1.5 years, which suggests that coherence over multiple days is 6-7 years away.

The 1.5 years thing is just a very rough ballpark though, could probably be convinced to double or halve it by doing some more careful case studies.

Why is it cheating? That seems like the whole point of my framework - that we're comparing what AIs can do in any amount of time to what humans can do in a bounded amount of time.

But then we could just ask the question: “Can you please pose a question about string theory that no AI would have any prayer of answering, and then answer it yourself?” That’s not cherry-picking, or at least not in the same way.

But can't we equivalently just ask an AI to pose a question that no human would have a prayer of answering in one second? It wouldn't even need to be a trivial memorization thing, it could also be a math problem complex enough that humans can't do it that quickly, or drawing a link between two very different domains of knowledge.

How long would it take (in months) to train a smart recent college graduate with no specialized training in my field to complete this task?

This doesn't seem like a great metric because there are many tasks that a college grad can do with 0 training that current AI can't do, including:

  • Download and play a long video game to completion
  • Read and summarize a whole book
  • Spend a month planning an event

I do think that there's something important about this metric, but I think it's basically subsumed by my metric: if the task is "spend a month doing novel R&D for lidar", then my framework predicts that we'll need 1-month AGI for that. If the task is instead "answer the specific questions about lidar which this expert has been studying", then I claim that this is overfitting and therefore not a fair comparison; even if you expand it to "questions about lidar in general" there's probably a bunch of stuff that GPT-4 will know that the expert won't.

For the t-AGI framework, maybe you should also specify that the human starts the task only knowing things that are written multiple times on the internet. For example, Ed Witten could give snap (1-second) responses to lots of string theory questions that are WAY beyond current AI, using idiosyncratic intuitions he built up over many years. Likewise a chess grandmaster thinking about a board state for 1 second could crush GPT-4 or any other AI that wasn’t specifically and extensively trained on chess by humans.

I feel pretty uncertain about this, actually. Sure, there are some questions that don't appear at all on the internet, but most human knowledge is, so you'd have to cherry-pick questions. And presumably GPT-4 has also inferred a bunch of intuitions from internet data which weren't explicitly written down there. In other words: even if this is true, it doesn't feel centrally relevant.

These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".

For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.

I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.

(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training models using debate without adding any more techniques; but I don't think that really matters. We need to get to the moon, not to Andromeda.)

To preserve my current shards, I don't need to seek out a huge number of dogs proactively, but rather I just need to at least behave in conformance with the advantage function implied by my value head, which probably means "treading water" and seeing dogs sometimes in situations similar to historical dog-seeing events.

I think this depends sensitively on whether the "actor" and the "critic" in fact have the same goals, and I feel pretty confused about how to reason about this. For example, in some cases they could be two separate models, in which case the critic will most likely accurately estimate that "treading water" is in fact a negative-advantage action (unless there's some sort of acausal coordination going on). Or they could be two copies of the same model, in which case the critic's responses will depend on whether its goals are indexical or not (if they are, they're different from the actor's goals; if not, they're the same) and how easily it can coordinate with the actor. Or it could be two heads which share activations, in which case we can plausibly just think of the critic and the actor as two types of outcomes taken by a single coherent agent - but then the critic doesn't need to produce a value function that's consistent with historical events, because an actor and a critic that are working together could gradient hack into all sorts of weird equilibria.

Load More