ricraz

Richard Ngo. I'm an AI safety research engineer at DeepMind (all opinions my own, not theirs). I'm from New Zealand and now based in London; I also did my undergrad and masters degrees in the UK (in Computer Science, Philosophy, and Machine Learning). Blog: thinkingcomplete.blogspot.com

ricraz's Comments

AGIs as populations

As opposed to coming up with powerful and predictive concepts, and refining them over time. Of course argument and counterargument are crucial to that, so there's no sharp line between this and "patching", but for me the difference is: are you starting with the assumption that the idea is fundamentally sound, and you just need to fix it up a bit to address objections? If you are in that position despite not having fleshed out the idea very much, that's what I'd characterise as "patching your way to good arguments".

AGIs as populations

Mostly "Wei Dai should write a blogpost that more clearly passes your "sniff test" of "probably compelling enough to be worth more of my attention"". And ideally a whole sequence or a paper.

It's possible that Wei has already done this, and that I just haven't noticed. But I had a quick look at a few of the blog posts linked in the "Disjunctive scenarios" post, and they seem to overall be pretty short and non-concrete, even for blog posts. Also, there are literally thirty items on the list, which makes it hard to know where to start (and also suggests low average quality of items). Hence why I'm asking Wei for one which is unusually worth engaging with; if I'm positively surprised, I'll probably ask for another.

AGIs as populations
Many of my "disjunctive" arguments were written specifically with that scenario in mind.

Cool, makes sense. I retract my pointed questions.

I guess I have a high prior that making something smarter than human is dangerous unless we know exactly what we're doing including the social/political aspects, and you don't, so you think the burden of proof is on me?

This seems about right. In general when someone proposes a mechanism by which the world might end, I think the burden of proof is on them. You're not just claiming "dangerous", you're claiming something like "more dangerous than anything else has ever been, even if it's intent-aligned". This is an incredibly bold claim and requires correspondingly thorough support.

does the current COVID-19 disaster not make you more pessimistic about "whatever efforts people will make when the problem starts becoming more apparent"?

Actually, COVID makes me a little more optimistic. First because quite a few countries are handling it well. Secondly because I wasn't even sure that lockdowns were a tool in the arsenal of democracies, and it seemed pretty wild to shut the economy down for so long. But they did. Also essential services have proven much more robust than I'd expected (I thought there would be food shortages, etc).

AGIs as populations

I'm pretty skeptical of this as a way of making progress. It's not that I already have strong disagreements with your arguments. But rather, if you haven't yet explained them thoroughly, I expect them to be underspecified, and use some words and concepts that are wrong in hard-to-see ways. One way this might happen is if those arguments use concepts (like "metaphilosophy") that kinda intuitively seem like they're pointing at something, but come with a bunch of connotations and underlying assumptions that make actually understanding them very tricky.

So my expectation for what happens here is: I look at one of your arguments, formulate some objection X, and then you say either: "No, that wasn't what I was claiming", or "Actually, ~X is one of the implicit premises", or "Your objection doesn't make any sense in the framework I'm outlining" and then we repeat this a dozen or more times. I recently went through this process with Rohin, and it took a huge amount of time and effort (both here and in private conversation) to get anywhere near agreement, despite our views on AI being much more similar than yours and mine.

And even then, you'll only have fixed the problems I'm able to spot, and not all the others. In other words, I think of patching your way to good arguments as kinda like patching your way to safe AGI. (To be clear, none of this is meant as specific criticism of your arguments, but rather as general comments about any large-scale arguments using novel concepts that haven't been made very thoroughly and carefully).

Having said this, I'm open to trying it for one of your arguments. So perhaps you can point me to one that you particularly want engagement on?

AGIs as populations
my own epistemic state, which is that arguments for AI risk are highly disjunctive, most types of AGI (not just highly agentic ones) are probably unsafe (i.e., are likely to lead us away from rather than towards a success story), at best probably only a few very specific AGI designs (which may well be agentic if combined with other properties) are both feasible and safe (i.e., can count as success stories)

Yeah, I guess I'm not surprised that we have this disagreement. To briefly sketch out why I disagree (mostly for common knowledge; I don't expect this to persuade you):

I think there's something like a logistic curve for how seriously we should take arguments. Almost all arguments are bad, and have many many ways in which they might fail. This is particularly true for arguments trying to predict the future, since they have to invent novel concepts to do so. Only once you've seen a significant amount of work put into exploring an argument, the assumptions it relies on, and the ways it might be wrong, should you start to assign moderate probability that the argument is true, and that the concepts it uses will in hindsight make sense.

Most of the arguments mentioned in your post on disjunctive safety arguments fall far short of any reasonable credibility threshold. Most of them haven't even had a single blog post which actually tries to scrutinise them in a critical way, or lay out their key assumptions. And to be clear, a single blog post is just about the lowest possible standard you might apply. Perhaps it'd be sufficient in a domain where claims can be very easily verified, but when we're trying to make claims that a given effect will be pivotal for the entire future of humanity despite whatever efforts people will make when the problem starts becoming more apparent, we need higher standards to get to the part of the logistic curve with non-negligible gradient.

This is not an argument for dismissing all of these possible mechanisms out of hand, but an argument that they shouldn't (yet) be given high credence. I think they are often given too high credence because there's a sort of halo effect from the arguments which have been explored in detail, making us more willing to consider arguments that in isolation would seem very out-there. When you think about the arguments made in your disjunctive post, how hard do you try to imagine each one conditional on the knowledge that the other arguments are false? Are they actually compelling in a world where Eliezer is wrong about intelligence explosions and Paul is wrong about influence-seeking agents? (Maybe you'd say that there are legitimate links between these arguments, e.g. common premises - but if so, they're not highly disjunctive).

Getting to an AGI that can safely do human or superhuman level safety work would be a success story in itself, which I labeled "Research Assistant" in my post

Good point, I shall read that post more carefully. I still don't think that this post is tied to the Research Assistant success story though.

AGIs as populations

My thought process when I use "safer" and "less safe" in posts like this is: the main arguments that AGI will be unsafe depends on it having certain properties, like agency, unbounded goals, lack of interpretability, desire and ability to self-improve, and so on. So reducing the extent to which it has those properties will make it safer, because those arguments will be less applicable.

I guess you could have two objections to this:

  • Maybe safety is non-monotonic in those properties.
  • Maybe you don't get any reduction in safety until you hit a certain threshold (corresponding to some success story).

I tend not to worry so much about these two objections because to me, the properties I outlined above are still too vague to have a good idea of the landscape of risks with respect to those properties. Once we know what agency is, we can talk about its monotonicity. For now my epistemic state is: extreme agency is an important component of thee main argument for risk, so all else equal reducing it should reduce risk.

I like the idea of tying safety ideas to success stories in general, though, and will try to use it for my next post, which proposes more specific interventions during deployment. Having said that, I also believe that most safety work will be done by AGIs, and so I want to remain open-minded to success stories that are beyond my capability to predict.

AGIs as populations

Nothing in particular. My main intention with this post was to describe a way the world might be, and some of the implications. I don't think such work should depend on being related to any specific success story.

Multi-agent safety

I'm hoping there's a big qualitative difference between fine-tuning on the CEO task versus the "following instructions" task. Perhaps the magnitude of the difference would be something like: starting training on the new task 99% of the way through training, versus starting 20% of the way through training. (And 99% is probably an underestimate: the last 10000 years of civilisation are much less than 1% of the time we've spent evolving from, say, the first mammals).

Plus on the follow human instructions task you can add instructions which specifically push against whatever initial motivations they had, which is much harder on the CEO task.

I agree that this is a concern though.

Multi-agent safety

I should clarify that when I think about obedience, I'm thinking obedience to the spirit of an instruction, not just the wording of it. Given this, the two seem fairly similar, and I'm open to arguments about whether it's better to talk in terms of one or the other. I guess I favour "obedience" because it has fewer connotations of agency - if you're "doing what a human wants you to do", then you might run off and do things before receiving any instructions. (Also because it's shorter and pithier - "the goal of doing what humans want" is a bit of a mouthful).

Competitive safety via gradated curricula

Yeah, so I guess opinions on this would differ depending on how likely people think existential risk from AGI is. Personally, it's clear to me that agentic misaligned superintelligences are bad news - but I'm much less persuaded by descriptions of how long-term maximising behaviour arises in something like an oracle. The prospect of an AGI that's much more intelligent than humans and much less agentic seems quite plausible - even, perhaps, in a RL agent.

Load More