ricraz

Richard Ngo. I'm an AI safety research engineer at DeepMind (all opinions my own, not theirs). I'm from New Zealand and now based in London; I also did my undergrad and masters degrees in the UK (in Computer Science, Philosophy, and Machine Learning). Blog: thinkingcomplete.blogspot.com

ricraz's Comments

Demons in Imperfect Search

Oh actually, I now see the explanation, from the same post, that this can arise when the gene causing male bias is itself on the Y-chromosome.

Segregation-distorters subvert the mechanisms that usually guarantee fairness of sexual reproduction. For example, there is a segregation-distorter on the male sex chromosome of some mice which causes only male children to be born, all carrying the segregation-distorter. Then these males impregnate females, who give birth to only male children, and so on. You might cry "This is cheating!" but that's a human perspective; the reproductive fitness of this allele is extremely high, since it produces twice as many copies of itself in the succeeding generation as its nonmutant alternative. Even as females become rarer and rarer, males carrying this gene are no less likely to mate than any other male, and so the segregation-distorter remains twice as fit as its alternative allele. It's speculated that real-world group selection may have played a role in keeping the frequency of this gene as low as it seems to be. In which case, if mice were to evolve the ability to fly and migrate for the winter, they would probably form a single reproductive population, and would evolve to extinction as the segregation-distorter evolved to fixation.
Demons in Imperfect Search

+1, creating a self-reinforcing feedback loop =/= being an optimiser, and so I think any explanation of demons needs to focus on them making deliberate choices to reinforce themselves.

Demons in Imperfect Search
This can kick off an unstable feedback loop, e.g. a gene which biases toward male children can result in a more and more male-skewed population until the species dies out.

I'm suspicious of this mechanism; I'd think that as the number of males increases, there's increasing selection pressure against this gene. Do you have a reference?

[This comment is no longer endorsed by its author]Reply
Disentangling arguments for the importance of AI safety

I think #3 could occur because of #2 (which I now mostly call "inner misalignment"), but it could also occur because of outer misalignment.

Broadly speaking, though, I think you're right that #2 and #3 are different types of things. Because of that and other issues, I no longer think that this post disentangles the arguments satisfactorily; I'll make a note of this at the top of the document.

Gradient hacking

I wasn't claiming that there'll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent's reward (there'd need to be some further module which chooses behaviours in order to get that much reward, but let's ignore it for now). Each of the submodules' outputs measures how much that subnetwork thinks the agent's original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they've passed the threshold of corruption, which makes the agent get 0 reward.

I agree that, at this point, there's no gradient signal to change the subnetworks. My points are that:

  1. There's still a gradient signal to change the OR gate (in this case, the implementation of multiplication).
  2. Consider how they got to the point of outputting 0. They must have been decreasing from 1 as the overall network changed. So as the network changed, and they started producing outputs less than 1, there'd be pressure to modify them.
  3. The point above isn't true if the subnetworks go from 1 to 0 within one gradient step. In that case, the network will likely either bounce back and forth across the threshold (eroding the OR gate every time it does so) or else remain very close to the threshold (since there's no penalty for doing so). But since the transition from 1 to 0 needs to be continuous at *some* resolution, staying very *very* close to the threshold will produce subnetwork output somewhere between 0 and 1, which creates pressure for the subnetworks to be less accurate.

4. It's non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain? (Independence is very tricky because they're part of the same plan, and so a change in your underlying motivation to pursue that plan affects both). Ones which are so sensitive to your motivations that they can go from 1 to 0 within the space of a single gradient update?

To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it's probably false or nonsensical for reasons I haven't thought of yet.

An intuition I'm drawing on here: https://lamport.azurewebsites.net/pubs/buridan.pdf

Gradient hacking

In the section you quoted I'm talking about the case in which the extent to which the agent fails is fairly continuous. Also note that the OR function is not differentiable, and so the two subnetworks must be implementing some continuous approximation to it. In that case, it seems likely to me that there's a gradient signal to change the failing-hard mechanism.

I feel like the last sentence was a little insufficient but I'm pretty uncertain about how to think intuitively about this topic. The only thing I'm fairly confident about is that intuitions based on discrete functions are somewhat misleading.

Gradient hacking
The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn't. Now, if the model gets to the point where it's actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won't actually change its objective, since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.

I don't think this argument works. After the agent has made that commitment, it needs to set some threshold for the amount of goal shift that will cause it to fail hard. But until the agent hits that threshold, the gradient will continue to point in the direction of that threshold. And with a non-infinitesimal learning rate, you'll eventually cross that threshold, and the agent will respond by failing hard.

A possible counterargument: the agent's ability to detect and enforce that threshold is not discrete, but also continuous, and so approaching the threshold will incur a penalty. But if that's the case, then the gradients will point in the direction of removing the penalty by reducing the agent's determination to fail upon detecting goal shift.

The way that this might still work is if modifications to this type of high-level commitment are harder to "detect" in partial derivatives than modifications to the underlying goals - e.g. if it's hard to update away from the commitment without reducing the agent's competence in other ways. And this seems kinda plausible, because high-level thought narrows down the space of outcomes sharply. But this is even more speculative.

Realism about rationality

I'll try respond properly later this week, but I like the point that embedded agency is about boundedness. Nevertheless, I think we probably disagree about how promising it is "to start with idealized rationality and try to drag it down to Earth rather than the other way around". If the starting point is incoherent, then this approach doesn't seem like it'll go far - if AIXI isn't useful to study, then probably AIXItl isn't either (although take this particular example with a grain of salt, since I know almost nothing about AIXItl).

I appreciate that this isn't an argument that I've made in a thorough or compelling way yet - I'm working on a post which does so.

Realism about rationality

Yeah, I should have been much more careful before throwing around words like "real". See the long comment I just posted for more clarification, and in particular this paragraph:

I'm not trying to argue that concepts which we can't formalise "aren't real", but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can't formalise, and that it's those incoherent extrapolations which "aren't real" (I agree that this was quite unclear in the original post).
Realism about rationality

I like this review and think it was very helpful in understanding your (Abram's) perspective, as well as highlighting some flaws in the original post, and ways that I'd been unclear in communicating my intuitions. In the rest of my comment I'll try write a synthesis of my intentions for the original post with your comments; I'd be interested in the extent to which you agree or disagree.

We can distinguish between two ways to understand a concept X. For lack of better terminology, I'll call them "understanding how X functions" and "understanding the nature of X". I conflated these in the original post in a confusing way.

For example, I'd say that studying how fitness functions would involve looking into the ways in which different components are important for the fitness of existing organisms (e.g. internal organs; circulatory systems; etc). Sometimes you can generalise that knowledge to organisms that don't yet exist, or even prove things about those components (e.g. there's probably useful maths connecting graph theory with optimal nerve wiring), but it's still very grounded in concrete examples. If we thought that we should study how intelligence functions in a similar way as we study how fitness functions, that might look like a combination of cognitive science and machine learning.

By comparison, understanding the nature of X involves performing a conceptual reduction on X by coming up with a theory which is capable of describing X in a more precise or complete way. The pre-theoretic concept of fitness (if it even existed) might have been something like "the number and quality of an organism's offspring". Whereas the evolutionary notion of fitness is much more specific, and uses maths to link fitness with other concepts like allele frequency.

Momentum isn't really a good example to illustrate this distinction, so perhaps we could use another concept from physics, like electricity. We can understand how electricity functions in a lawlike way by understanding the relationship between voltage, resistance and current in a circuit, and so on, even when we don't know what electricity is. If we thought that we should study how intelligence functions in a similar way as the discoverers of electricity studied how it functions, that might involve doing theoretical RL research. But we also want to understand the nature of electricity (which turns out to be the flow of electrons). Using that knowledge, we can extend our theory of how electricity functions to cases which seem puzzling when we think in terms of voltage, current and resistance in circuits (even if we spend almost all our time still thinking in those terms in practice). This illustrates a more general point: you can understand a lot about how something functions without having a reductionist account of its nature - but not everything. And so in the long term, to understand really well how something functions, you need to understand its nature. (Perhaps understanding how CS algorithms work in practice, versus understanding the conceptual reduction of algorithms to Turing Machines, is another useful example).

I had previously thought that MIRI was trying to understand how intelligence functions. What I take from your review is that MIRI is first trying to understand the nature of intelligence. From this perspective, your earlier objection makes much more sense.

However, I still think that there are different ways you might go about understanding the nature of intelligence, and that "something kind of like rationality realism" might be a crux here (as you mention). One way that you might try to understand the nature of intelligence is by doing mathematical analysis of what happens in the limit of increasing intelligence. I interpret work on AIXI, logical inductors, and decision theory as falling into this category. This type of work feels analogous to some of Einstein's thought experiments about the limit of increasing speed. Would it have worked for discovering evolution? That is, would starting with a pre-theoretic concept of fitness and doing mathematical analysis of its limiting cases (e.g. by thinking about organisms that lived for arbitrarily long, or had arbitrarily large numbers of children) have helped people come up with evolution? I'm not sure. There's an argument that Malthus did something like this, by looking at long-term population dynamics. But you could also argue that the key insights leading up to the discovery evolution were primarily inspired by specific observations about the organisms around us. And in fact, even knowing evolutionary theory, I don't think that the extreme cases of fitness even make sense. So I would say that I am not a realist about "perfect fitness", even though the concept of fitness itself seems fine.

So an attempted rephrasing of the point I was originally trying to make, given this new terminology, is something like "if we succeed in finding a theory that tells us the nature of intelligence, it still won't make much sense in the limit, which is the place where MIRI seems to be primarily studying it (with some exceptions, e.g. your Partial Agency sequence). Instead, the best way to get that theory is to study how intelligence functions."

The reason I called it "rationality realism" not "intelligence realism" is that rationality has connotations of this limit or ideal existing, whereas intelligence doesn't. You might say that X is very intelligent, and Y is more intelligent than X, without agreeing that perfect intelligence exists. Whereas when we talk about rationality, there's usually an assumption that "perfect rationality" exists. I'm not trying to argue that concepts which we can't formalise "aren't real", but rather that some concepts become incoherent when extrapolated a long way, and this tends to occur primarily for concepts which we can't formalise, and that it's those incoherent extrapolations like "perfect fitness" which "aren't real" (I agree that this was quite unclear in the original post).

My proposed redefinition:

  • The "intelligence is intelligible" hypothesis is about how lawlike the best description of how intelligence functions will turn out to be.
  • The "realism about rationality" hypothesis is about how well-defined intelligence is in the limit (where I think of the limit of intelligence as "perfect rationality", and "well-defined" with respect not to our current understanding, but rather with respect to the best understanding of the nature of intelligence we'll ever discover).
Load More