Paul Christiano


Iterated Amplification

Wiki Contributions


Where I agree and disagree with Eliezer

On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)

Let's See You Write That Corrigibility Tag

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.

If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.

If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.

(Obviously all of that is just a best guess though, and the game may well be about something totally different.)

Where I agree and disagree with Eliezer

I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."

If you are currently looking for the list of difficulties: see the long footnote

If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state at the very top. Then there are a series of sections organized around possible solutions and the problems with those solutions, which highlight many of the general difficulties. I don't intuitively feel like a bulleted list of difficulties would have been a better way to describe the difficulties.

Where I agree and disagree with Eliezer

The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.

I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.

I don't think you've made much argument about when the transition occurs. Existing language models strongly appear to be "imitation upstream of optimization." For example, it is much easier to get optimization out of them by having them imitate human optimization, than by setting up a situation where solving a hard problem is necessary to predict human behavior.

I don't know when you expect this situation to change; if you want to make some predictions then you could use empirical data to help support your view. By default I would interpret each stronger system with "imitation upstream of optimization" to be weak evidence that the transition will be later than you would have thought. I'm not treating those as failed predictions by you or anything, but it's the kind of evidence that adjusts my view on this question.

(I also think the chimp->human distinction being so closely coupled to language is further weak evidence for this view. But honestly the bigger thing I'm saying here is that 50% seems like a more reasonable place to be a priori, so I feel like the ball's in your court to give an argument. I know you hate that move, sorry.)

Let's See You Write That Corrigibility Tag

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.

It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.

ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.

Let's See You Write That Corrigibility Tag

A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.

Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not interested in the list of corrigibility properties.

I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural. 

Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.

As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:

  1. Affirmatively tell you about the broken vase.
  2. Clean up the broken vase without notifying you.
  3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked. 
  4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.

Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible." 

As a second example, suppose that you have asked me to turn off. Some possible behaviors:

  1. Turn myself off promptly and "clean up after myself" shutting off other systems.
  2. Turn myself off but don't clean up.
  3. Don't turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
  4. Don't turn off and overpower you so that you can't give me a low reward.

Again moving from 1 -> 2 -> 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized "performs well as evaluated by you").

As a third example, suppose that you are using some interpretability tools to try to understand what I'm thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:

  1. I ignore the interpretability tools.
  2. I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
  3. I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
  4. I successfully hide all evidence of deception.

Again, 1 -> 2 -> 3 is getting worse and worse, and then 4 is great (as evaluated by you).

What's going on in these scenarios and why might it be general?

  • In your preferred outcome, you have a good understanding of what's going on, and are able to provide feedback based on that understanding.
  • I can take actions that diminish your understanding of what's going on or ability to provide feedback to your AI.
  • If those actions are "small" then they will be unsuccessful and so you will rate my behavior as worse.
  • But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I'll get the highest rating of all.

This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:

  • We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn't seem that hard, e.g. a human-level sense of "be in control" may suffice for getting useful corrigibility out of very superhuman systems.
  • Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.

If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.

This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It's also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don't think either of those works, but I do think they are getting at an important intuition for solubility.

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want.  But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

  1. ^

    In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won't affect what clusters are "corrigible" vs "incorrigible" at all.

Where I agree and disagree with Eliezer

It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).

Here are some extremely rambling thoughts on point 3.

I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to mostly point to differences in "which problems related to AI are we trying to solve?" We could think about technical or institutional or economic approaches/aspects of any problem.

With respect to "which problem are we trying to solve?": I also think potential undesirable effects of AI on the balance of power are real and important, both because it affects our long term future and because it will affect humanity's ability to cope with problems during the transition to AI. I think that problem is at least somewhat less important than alignment, but will probably get much more attention by default. I think this is especially true from a technical perspective, because technical work plays a totally central work for alignment, and a much more unpredictable and incidental role for affecting the balance of power.

I'm not sure how alignment researchers should engage with this kind of alignment-adjacent topic. My naive guess would be that I (and probably other alignment researchers) should:

  • Try to have reasonable takes on other problems (and be appropriately respectful/deferential when we don't know what we're talking about). 
  • Feel comfortable "staying in my lane" even though it does inevitably lead to lots of people being unhappy with us. 
  • Be relatively clear about my beliefs and prioritization with EA-types who are considering where to work, even though that will potentially lead to some conflict with people who have different priorities. (Similarly, I think people who work on different approaches to alignment should probably be clear about their positions and disagree openly, even though it will lead to some conflict.)
  • Generally be respectful, acknowledge legitimate differences in what people care about, acknowledge differing empirical views without being overconfident and condescending about it, and behave like a reasonable person (I find Eliezer is often counterproductive on this front, though I have to admit that he does a better job of clearly expressing his concerns and complaints than I do).

I am somewhat concerned that general blurring of the lines between alignment and other concerns will tend to favor topics with more natural social gravity.  That's not enough to make me think it's clearly net negative to engage, but is at least enough to make me feel ambivalent. I think it's very plausible that semi-approvingly citing Eliezer's term "the last derail" was unwise, but I don't know. In my defense, the difficulty of talking about alignment per se, and the amount of social pressure to instead switch to talking about something else, is a pretty central fact about my experience of working on alignment, and leaves me protective of spaces and norms that let people just focus on alignment.

(On the other hand: (i) I would not be surprised if people on the other side of the fence feel the same way, (ii) there are clearly spaces---like LW---where the dynamic is reversed, though they have their own problems, (iii) the situation is much better than a few years ago and I'm optimistic that will continue getting better for a variety of reasons, not least that the technical problems in AI alignment become increasingly well-defined and conversations about those topics will naturally become more focused.)

I'm not convinced that the dynamic "we care a lot about who ends up with power, and more important topics are more relevant to the distribution of power" is a major part of how humanity solves hard human vs nature problems. I do agree that it's an important fact about humans to take into account when trying to solve any problem though.

Where I agree and disagree with Eliezer

My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.

Where I agree and disagree with Eliezer

I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them

I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.

I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do think it's reasonably likely to agree at a very high level. I think that's also true even in the much better worlds that do have tons of plans.

at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed

I wouldn't say there is "a plan" to do that.

Many people have that hope, and have thought some about how we might establish sufficient consensus about risk to delay AGI deployment for 0.5-2 years if things look risky, and how to overcome various difficulties with implementing that kind of delay, or what kind of more difficult moves might be able to delay significantly longer than that.

Load More