Richard Ngo. I'm an AI safety research engineer at DeepMind (all opinions my own, not theirs). I'm from New Zealand and now based in London; I also did my undergrad and masters degrees in the UK (in Computer Science, Philosophy, and Machine Learning). Blog: thinkingcomplete.blogspot.com

ricraz's Comments

Realism about rationality

I'll try respond properly later this week, but I like the point that embedded agency is about boundedness. Nevertheless, I think we probably disagree about how promising it is "to start with idealized rationality and try to drag it down to Earth rather than the other way around". If the starting point is incoherent, then this approach doesn't seem like it'll go far - if AIXI isn't useful to study, then probably AIXItl isn't either (although take this particular example with a grain of salt, since I know almost nothing about AIXItl).

I appreciate that this isn't an argument that I've made in a thorough or compelling way yet - I'm working on a post which does so.

Realism about rationality

Yeah, I should have been much more careful before throwing around words like "real". See the long comment I just posted for more clarification, and in particular this paragraph:

I'm not trying to argue that concepts which we can't formalise "aren't real", but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can't formalise, and that it's those incoherent extrapolations which "aren't real" (I agree that this was quite unclear in the original post).
Realism about rationality

I like this review and think it was very helpful in understanding your (Abram's) perspective, as well as highlighting some flaws in the original post, and ways that I'd been unclear in communicating my intuitions. In the rest of my comment I'll try write a synthesis of my intentions for the original post with your comments; I'd be interested in the extent to which you agree or disagree.

We can distinguish between two ways to understand a concept X. For lack of better terminology, I'll call them "understanding how X functions" and "understanding the nature of X". I conflated these in the original post in a confusing way.

For example, I'd say that studying how fitness functions would involve looking into the ways in which different components are important for the fitness of existing organisms (e.g. internal organs; circulatory systems; etc). Sometimes you can generalise that knowledge to organisms that don't yet exist, or even prove things about those components (e.g. there's probably useful maths connecting graph theory with optimal nerve wiring), but it's still very grounded in concrete examples. If we thought that we should study how intelligence functions in a similar way as we study how fitness functions, that might look like a combination of cognitive science and machine learning.

By comparison, understanding the nature of X involves performing a conceptual reduction on X by coming up with a theory which is capable of describing X in a more precise or complete way. The pre-theoretic concept of fitness (if it even existed) might have been something like "the number and quality of an organism's offspring". Whereas the evolutionary notion of fitness is much more specific, and uses maths to link fitness with other concepts like allele frequency.

Momentum isn't really a good example to illustrate this distinction, so perhaps we could use another concept from physics, like electricity. We can understand how electricity functions in a lawlike way by understanding the relationship between voltage, resistance and current in a circuit, and so on, even when we don't know what electricity is. If we thought that we should study how intelligence functions in a similar way as the discoverers of electricity studied how it functions, that might involve doing theoretical RL research. But we also want to understand the nature of electricity (which turns out to be the flow of electrons). Using that knowledge, we can extend our theory of how electricity functions to cases which seem puzzling when we think in terms of voltage, current and resistance in circuits (even if we spend almost all our time still thinking in those terms in practice). This illustrates a more general point: you can understand a lot about how something functions without having a reductionist account of its nature - but not everything. And so in the long term, to understand really well how something functions, you need to understand its nature. (Perhaps understanding how CS algorithms work in practice, versus understanding the conceptual reduction of algorithms to Turing Machines, is another useful example).

I had previously thought that MIRI was trying to understand how intelligence functions. What I take from your review is that MIRI is first trying to understand the nature of intelligence. From this perspective, your earlier objection makes much more sense.

However, I still think that there are different ways you might go about understanding the nature of intelligence, and that "something kind of like rationality realism" might be a crux here (as you mention). One way that you might try to understand the nature of intelligence is by doing mathematical analysis of what happens in the limit of increasing intelligence. I interpret work on AIXI, logical inductors, and decision theory as falling into this category. This type of work feels analogous to some of Einstein's thought experiments about the limit of increasing speed. Would it have worked for discovering evolution? That is, would starting with a pre-theoretic concept of fitness and doing mathematical analysis of its limiting cases (e.g. by thinking about organisms that lived for arbitrarily long, or had arbitrarily large numbers of children) have helped people come up with evolution? I'm not sure. There's an argument that Malthus did something like this, by looking at long-term population dynamics. But you could also argue that the key insights leading up to the discovery evolution were primarily inspired by specific observations about the organisms around us. And in fact, even knowing evolutionary theory, I don't think that the extreme cases of fitness even make sense. So I would say that I am not a realist about "perfect fitness", even though the concept of fitness itself seems fine.

So an attempted rephrasing of the point I was originally trying to make, given this new terminology, is something like "if we succeed in finding a theory that tells us the nature of intelligence, it still won't make much sense in the limit, which is the place where MIRI seems to be primarily studying it (with some exceptions, e.g. your Partial Agency sequence). Instead, the best way to get that theory is to study how intelligence functions."

The reason I called it "rationality realism" not "intelligence realism" is that rationality has connotations of this limit or ideal existing, whereas intelligence doesn't. You might say that X is very intelligent, and Y is more intelligent than X, without agreeing that perfect intelligence exists. Whereas when we talk about rationality, there's usually an assumption that "perfect rationality" exists. I'm not trying to argue that concepts which we can't formalise "aren't real", but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can't formalise, and that it's those incoherent extrapolations like "perfect fitness" which "aren't real" (I agree that this was quite unclear in the original post).

My proposed redefinition:

  • The "intelligence is intelligible" hypothesis is about how lawlike the best description of how intelligence functions will turn out to be.
  • The "realism about rationality" hypothesis is about how well-defined intelligence is in the limit (where I think of the limit of intelligence as "perfect rationality", and "well-defined" with respect not to our current understanding, but rather with respect to the best understanding of the nature of intelligence we'll ever discover).
[AN #80]: Why AI risk might be solved without additional intervention from longtermists

Some abstractions are heavily determined by the territory. The concept of trees is pretty heavily determined by the territory. Whereas the concept of betrayal is determined by the way that human minds function, which is determined by other people's abstractions. So while it seems reasonably likely to me that an AI "naturally thinks" in terms of the same low-level abstractions as humans, it thinking in terms of human high-level abstractions seems much less likely, absent some type of safety intervention. Which is particularly important because most of the key human values are very high-level abstractions.

Coherence arguments do not imply goal-directed behavior

+1, I would have written my own review, but I think I basically just agree with everything in this one (and to the extent I wanted to further elaborate on the post, I've already done so here).

Coherence arguments do not imply goal-directed behavior

This post directly addresses what I think is the biggest conceptual hole in our current understanding of AGI: what type of goals will it have, and why? I think it's been important in pushing people away from unhelpful EU-maximisation framings, and towards more nuanced and useful ways of thinking about goals.

Specification gaming examples in AI

I see this referred to a lot, and also find myself referring to it a lot. Having concrete examples of specification gaming is a valuable shortcut when explaining safety problems, as a "proof of concept" of something going wrong.

Open question: are minimal circuits daemon-free?

This post grounds a key question in safety in a relatively simple way. It led to the useful distinction between upstream and downstream daemons, which I think is necessary to make conceptual progress on understanding when and how daemons will arise.

The Rocket Alignment Problem

It's been very helpful for understanding the motivations behind MIRI's "deconfusion" research, in particular through linking it to another hard technical problem.

Rohin Shah on reasons for AI optimism

I predict that Rohin would say something like "the phrase 'approximately optimal for some objective/utility function' is basically meaningless in this context, because for any behaviour, there's some function which it's maximising".

You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we're back to being very uncertain about which case we'll end up in.

Load More