Richard Ngo

Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


AGI safety from first principles
Shaping safer goals


ricraz's Shortform

Ah, yeah, that's a great point. Although I think act-based agents is a pretty bad name, since those agents may often carry out a whole bunch of acts in a row - in fact, I think that's what made me overlook the fact that it's pointing at the right concept. So not sure if I'm comfortable using it going forward, but thanks for point that out.

ricraz's Shortform

Oracle-genie-sovereign is a really useful distinction that I think I (and probably many others) have avoided using mainly because "genie" sounds unprofessional/unacademic. This is a real shame, and a good lesson for future terminology.

ricraz's Shortform

Oh, actually, you're right (that you were wrong). I think I made the same mistake in my previous comment. Good catch.

ricraz's Shortform

Wait, really? I thought it made sense (although I'd contend that most people don't think about AIXI in terms of those TMs reinforcing hypotheses, which is the point I'm making). What's incorrect about it?

ricraz's Shortform

Yes we do: training is our evolutionary history, deployment is an individual lifetime. And our genomes are our reusable parameters.

Unfortunately I haven't yet written any papers/posts really laying out this analogy, but it's pretty central to the way I think about AI, and I'm working on a bunch of related stuff as part of my PhD, so hopefully I'll have a more complete explanation soon.

Continuing the takeoffs debate

So my reasoning is something like:

  • There's the high-level argument that AIs will recursively self-improve very fast.
  • There's support for this argument from the example of humans.
  • There's a rebuttal to that support from the concept of changing selection pressures.
  • There's a counterrebuttal to changing selection pressures from my post.

By the time we reach the fourth level down, there's not that much scope for updates on the original claim, because at each level we lose confidence that we're arguing about the right thing, and also we've zoomed in enough that we're ignoring most of the relevant considerations.

I'll make this more explicit.

ricraz's Shortform

I suspect that AIXI is misleading to think about in large part because it lacks reusable parameters - instead it just memorises all inputs it's seen so far. Which means the setup doesn't have episodes, or a training/deployment distinction; nor is any behaviour actually "reinforced".

A space of proposals for building safe advanced AI

Wouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.

Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.

Reply to Jebari and Lundborg on Artificial Superintelligence

But if there is a continuum between the two, then thinking about the end points in terms of goals is relevant in interpreting the degrees of productiveness of goals in the middle.

I don't see why this is the case - you can just think about the continuum from non-goal to goal instead, which should get you the same benefits.

Clarifying inner alignment terminology

Hmm, I think this is still missing something.

  1. "What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters" - I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.
  2. When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).
Load More