2018 Review Discussion

I think Paul Christiano’s research agenda for the alignment of superintelligent AGIs presents one of the most exciting and promising approaches to AI safety. After being very confused about Paul’s agenda, chatting with others about similar confusions, and clarifying with Paul many times over, I’ve decided to write a FAQ addressing common confusions around his agenda.

This FAQ is not intended to provide an introduction to Paul’s agenda, nor is it intended to provide an airtight defense. This FAQ only aims to clarify commonly misunderstood aspects of the agenda. Unless otherwise stated, all views are my own views of Paul’s views. (ETA: Paul does not have major disagreements with anything expressed in this FAQ. There are many small points he might have expressed differently, but he endorses...

2Alex Turner2mo
Yes, a value grounded in a factual error will get blown up by better epistemics, just as "be uncertain about the human's goals" will get blown up by your beliefs getting their entropy deflated to zero by the good ole process we call "learning about reality." But insofar as corrigibility is "chill out and just do some good stuff without contorting 4D spacetime into the perfect shape or whatever", there are versions of that which don't automatically get blown up by reality when you get smarter. As far as I can tell, some humans are living embodiments of the latter. I have some "benevolent libertarian" values pushing me Pareto improving everyone's resource counts and letting them do as they will with their compute budgets. What's supposed to blow that one up? This paragraph as a whole seems to make a lot of unsupported-to-me claims and seemingly equivocates between the two bolded claims, which are quite different. The first is that we (as adult humans with relatively well-entrenched values) would not want to defer to a strange alien. I agree. The second is that we wouldn't want to defer "even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it." I don't see why you believe that. Perhaps if we were otherwise socialized normally, we would end up unendorsing that value and not deferring? But I conjecture if that a person weren't raised with normal cultural influences, you could probably brainwash them into being aligned baby-eaters via reward shaping via brain stimulation reward. A utilitarian? Like, as Thomas Kwa asked, what are the type signatures of the utility functions you're imagining the AI to have? Your comment makes more sense to me if I imagine the utility function is computed over "conventional" objects-of-value [https://www.lesswrong.com/posts/dqSwccGTWyBgxrR58/turntrout-s-shortform-feed#cuTotpjqYkgcwnghp] .
2Alex Turner2mo
"Don't care" is quite strong. If you still hold this view -- why don't you care about 3? (Curious to hear from other people who basically don't care about 3, either.)

Yeah, "don't care" is much too strong. This comment was just meant in the context of the current discussion. I could instead say:

The kind of alignment agenda that I'm working on, and the one we're discussing here, is not relying on this kind of generalization of corrigibility. This kind of generalization isn't why we are talking about corrigibility.

However, I agree that there are lots of approaches to building AI that rely on some kind of generalization of corrigibility, and that studying those is interesting and I do care about how that goes.

In the contex... (read more)

One of the most pleasing things about probability and expected utility theory is that there are many coherence arguments that suggest that these are the “correct” ways to reason. If you deviate from what the theory prescribes, then you must be executing a dominated strategy. There must be some other strategy that never does any worse than your strategy, but does strictly better than your strategy with certainty in at least one situation. There’s a good explanation of these arguments here.

We shouldn’t expect mere humans to be able to notice any failures of coherence in a superintelligent agent, since if we could notice these failures, so could the agent. So we should expect that powerful agents appear coherent to us. (Note that it is possible that the...

I have no idea why I responded 'low' to 2. Does anybody think that's reasonable and fits in with what I wrote here, or did I just mean high?

"random utility-maximizer" is pretty ambiguous; if you imagine the space of all possible utility functions over action-observation histories and you imagine a uniform distribution over them (suppose they're finite, so this is doable), then the answer is low.

Heh, looking at my comment it turns out I said roughly the same thing 3 years ago.

I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.

The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.

Robustness to scaling up means that your AI system does not depend on not being...

Rereading this post while thinking about the approximations that we make in alignment, two points jump at me:

  • I'm not convinced that robustness to relative scale is as fundamental as the other two, because there is no reason to expect that in general the subcomponents will be significantly different in power, especially in settings like adversarial training where both parts are trained according to the same approach. That being said, I still agree that this is an interesting question to ask, and some proposal might indeed depend on a version of this.
  • Robustn
... (read more)

Note: weird stuff, very informal.

Suppose I search for an algorithm that has made good predictions in the past, and use that algorithm to make predictions in the future.

I may get a "daemon," a consequentialist who happens to be motivated to make good predictions (perhaps because it has realized that only good predictors survive). Under different conditions, the daemon may no longer be motivated to predict well, and may instead make "predictions" that help it achieve its goals at my expense.

I don't know whether this is a real problem or not. But from a theoretical perspective, not knowing is already concerning--I'm trying to find a strong argument that we've solved alignment, not just something that seems to work in practice.

I am pretty convinced that daemons are a real...

I consider the argument in this post a reasonably convincing negative answer to this question---a minimal circuit may nevertheless end up doing learning internally and thereby generate deceptive learned optimizers.

This suggests a second informal clarification of the problem (in addition to Wei Dai's comment): can the search for minimal circuits itself be responsible for generating deceptive behavior? Or is it always the case that something else was the offender and the search for minimal circuits is an innocent bystander?

If the search for minimal circuits ... (read more)

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators.

This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean.

In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed.


Consider a human assistant who is trying their hardest to do what H wants.

I’d say this assistant is aligned with...

I decided that the answer deserves its own post.

Crossposted from my blog.

One thing I worry about sometimes is people writing code with optimisers in it, without realising that that's what they were doing. An example of this: suppose you were doing deep reinforcement learning, doing optimisation to select a controller (that is, a neural network that takes a percept and returns an action) that generated high reward in some environment. Alas, unknown to you, this controller actually did optimisation itself to select actions that score well according to some metric that so far has been closely related to your reward function. In such a scenario, I'd be wary about your deploying that controller, since the controller itself is doing optimisation which might steer the world into a weird and unwelcome place.

In order to avoid such


Okay, so another necessary condition for being downstream from an optimizer is being causally downstream. I'm sure there are other conditions, but the claim still feels like an important addition to the conversation.

In which I propose a closed-form solution to low impact, increasing corrigibility and seemingly taking major steps to neutralize basic AI drives 1 (self-improvement), 5 (self-protectiveness), and 6 (acquisition of resources).

Previously: Worrying about the Vase: Whitelisting, Overcoming Clinginess in Impact Measures, Impact Measure Desiderata

To be used inside an advanced agent, an impact measure... must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure.
~ Safe Impact Measure

If we have a safe impact measure, we may have arbitrarily-intelligent unaligned agents which do small (bad) things instead of big (bad) things.

For the abridged experience, read up to "Notation", skip to "Experimental Results", and then to "Desiderata".

What is "Impact"?

One lazy Sunday afternoon, I...

ReviewNote: this is on balance a negative review of the post, at least least regarding the question of whether it should be included in a "Best of LessWrong 2018" compilation. I feel somewhat bad about writing it given that the author has already written a review that I regard as negative. That being said, I think that reviews of posts by people other than the author are important for readers looking to judge posts, since authors may well have distorted views of their own works. * The idea behind AUP, that ‘side effect avoidance’ should mean minimising changes in one’s ability to achieve arbitrary goals, seems very promising to me. I think the idea and its formulation in this post substantially moved forward the ‘impact regularisation’ line of research. This represents a change in opinion since I wrote this comment [https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#igEAF5vzwm27NaEQz] . * I think that this idea behind AUP has fairly obvious applications to human rationality and cooperation, although they aren’t spelled out in this post. This seems like a good candidate for follow-up work. * This post is very long, confusing to me in some sections, and contains a couple of English and mathematical typos. * I still believe that the formalism presented in this post has some flaws that make it not suitable for canonisation. For more detail, see my exchange in the descendents of this comment [https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#igEAF5vzwm27NaEQz] - I still mostly agree with my claims about the technical aspects of AUP as presented in this post. Fleshing out these details is also, in my opinion, a good candidate for follow-up work. * I think that the ideas behind AUP that I’m excited about are better communicated in other posts by TurnTrout.
1Alex Turner3y
I'm curious whether these are applications I've started to gesture at in Reframing Impact, or whether what you have in mind as obvious isn't a subset of what I have in mind. I'd be interested in seeing your shortlist. Without rereading all of the threads, I'd like to note that I now agree with Daniel about the subhistories issue. I also agree that the formalization in this post is overly confusing and complicated.

I'm curious whether these are applications I've started to gesture at in Reframing Impact

I confess that it's been a bit since I've read that sequence, and it's not obvious to me how to go from the beginnings of gestures to their referents. Basically what I mean is 'when trying to be cooperative in a group, preserve generalised ability to achieve goals', nothing more specific than that.

Cross-posted to the EA forum.


Like last year and the year before, I’ve attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task.

This year I have included several groups not covered in previous years, and read more widely in the literature.

My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should...


I talk here about how a mathematician mindset can be useful for AI alignment. But first, a puzzle:

Given , what is the least number such that for , the base representation of consists entirely of 0s and 1s?

If you want to think about it yourself, stop reading.

For =2, =2.

For =3, =3.

For =4, =4.

For =5, =82,000.

Indeed, 82,000 is 10100000001010000 in binary, 11011111001 in ternary, 110001100 in base 4, and 10111000 in base 5.

What about when =6?

So, a mathematician might tell you that this is an open problem. It is not known if there is any which consists of 0s and 1s in bases 2 through 6.

A scientist, on the other hand, might just tell you that clearly no such number exists. There...

I don't like the intro to the post. I feel like the example Scott gives makes the opposite of the point he wants it to make. Either a number with the given property exists or not. If such a number doesn't exist, creating a superintelligence won't change that fact. Given talk I've heard around the near certainty of AI doom, betting the human race on the nonexistence of a number like this looks pretty attractive by comparison -- and it's plausible there are AI alignment bets we could make that are analogous to this bet.

Faced with the astronomical amount of unclaimed and unused resources in our universe, one's first reaction is probably wonderment and anticipation, but a second reaction may be disappointment that our universe isn't even larger or contains even more resources (such as the ability to support 3^^^3 human lifetimes or perhaps to perform an infinite amount of computation). In a previous post I suggested that the potential amount of astronomical waste in our universe seems small enough that a total utilitarian (or the total utilitarianism part of someone’s moral uncertainty) might reason that since one should have made a deal to trade away power/resources/influence in this universe for power/resources/influence in universes with much larger amounts of available resources, it would be rational to behave as if this deal


I think that at the time this post came out, I didn't have the mental scaffolding necessary to really engage with it – I thought of this question as maybe important, but sort of "above my paygrade", something better left to other people who would have the resources to engage more seriously with it.

But, over the past couple years, the concepts here have formed an important component of my understanding of robust agency. Much of this came from private in-person conversations, but this post is the best writeup of the concept I'm cur... (read more)

1Ben Pace3y
NominationThis is a post that's stayed with me since it was published. The title is especially helpful as a handle. It is a simple reference for this idea, that there are deeply confusing philosophical problems that are central to our ability to attain most of the value we care about (and that this might be a central concern when thinking about AI). It's not been very close to areas I think about a lot, so I've not tried to build on it much, and would be interested in a review from someone who thinks in more detail about these matters more, but I expect they'll agree it's a very helpful post to exist.
Load More