On 22, I agree that my claim is incorrect. I think such systems probably won't obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that's compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.
If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.
If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.
(Obviously all of that is just a best guess though, and the game may well be about something totally different.)
I'm not sure if you are saying that you skimmed the report right now and couldn't find the list, or that you think that it was a mistake for the report not to contain a "centralized bullet point list of difficulties."
If you are currently looking for the list of difficulties: see the long footnote.
If you think the ELK report should have contained such a list: I definitely don't think we wrote this report optimally, but we tried our best and I'm not convinced this would be an improvement. The report is about one central problem that we attempt to state at the very top. Then there are a series of sections organized around possible solutions and the problems with those solutions, which highlight many of the general difficulties. I don't intuitively feel like a bulleted list of difficulties would have been a better way to describe the difficulties.
The question is when you get a misaligned mesaoptimizer relative to when you get superhuman behavior.
I think it's pretty clear that you can get an optimizer which is upstream of the imitation (i.e. whose optimization gives rise to the imitation), or you can get an optimizer which is downstream of the imitation (i.e. which optimizes in virtue of its imitation). Of course most outcomes are messier than those two extremes, but the qualitative distinction still seems really central to these arguments.
I don't think you've made much argument about when the transition occurs. Existing language models strongly appear to be "imitation upstream of optimization." For example, it is much easier to get optimization out of them by having them imitate human optimization, than by setting up a situation where solving a hard problem is necessary to predict human behavior.
I don't know when you expect this situation to change; if you want to make some predictions then you could use empirical data to help support your view. By default I would interpret each stronger system with "imitation upstream of optimization" to be weak evidence that the transition will be later than you would have thought. I'm not treating those as failed predictions by you or anything, but it's the kind of evidence that adjusts my view on this question.
(I also think the chimp->human distinction being so closely coupled to language is further weak evidence for this view. But honestly the bigger thing I'm saying here is that 50% seems like a more reasonable place to be a priori, so I feel like the ball's in your court to give an argument. I know you hate that move, sorry.)
If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.
It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.
ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.
A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."
I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.
Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not interested in the list of corrigibility properties.
I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural.
Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.
As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:
Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible."
As a second example, suppose that you have asked me to turn off. Some possible behaviors:
Again moving from 1 -> 2 -> 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized "performs well as evaluated by you").
As a third example, suppose that you are using some interpretability tools to try to understand what I'm thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:
Again, 1 -> 2 -> 3 is getting worse and worse, and then 4 is great (as evaluated by you).
What's going on in these scenarios and why might it be general?
This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:
If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.
This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It's also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don't think either of those works, but I do think they are getting at an important intuition for solubility.
My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want. But it still seems useful to go back and forth between these perspectives.
(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)
In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won't affect what clusters are "corrigible" vs "incorrigible" at all.
It sounds like we are broadly on the same page about 1 and 2 (presumably partly because my list doesn't focus on my spiciest takes, which might have generated more disagreement).
Here are some extremely rambling thoughts on point 3.
I agree that the interaction between AI and existing conflict is a very important consideration for understanding or shaping policy responses to AI, and that you should be thinking a lot about how to navigate (and potentially leverage) those dynamics if you want to improve how well we handle any aspect of AI. I was trying to mostly point to differences in "which problems related to AI are we trying to solve?" We could think about technical or institutional or economic approaches/aspects of any problem.
With respect to "which problem are we trying to solve?": I also think potential undesirable effects of AI on the balance of power are real and important, both because it affects our long term future and because it will affect humanity's ability to cope with problems during the transition to AI. I think that problem is at least somewhat less important than alignment, but will probably get much more attention by default. I think this is especially true from a technical perspective, because technical work plays a totally central work for alignment, and a much more unpredictable and incidental role for affecting the balance of power.
I'm not sure how alignment researchers should engage with this kind of alignment-adjacent topic. My naive guess would be that I (and probably other alignment researchers) should:
I am somewhat concerned that general blurring of the lines between alignment and other concerns will tend to favor topics with more natural social gravity. That's not enough to make me think it's clearly net negative to engage, but is at least enough to make me feel ambivalent. I think it's very plausible that semi-approvingly citing Eliezer's term "the last derail" was unwise, but I don't know. In my defense, the difficulty of talking about alignment per se, and the amount of social pressure to instead switch to talking about something else, is a pretty central fact about my experience of working on alignment, and leaves me protective of spaces and norms that let people just focus on alignment.
(On the other hand: (i) I would not be surprised if people on the other side of the fence feel the same way, (ii) there are clearly spaces---like LW---where the dynamic is reversed, though they have their own problems, (iii) the situation is much better than a few years ago and I'm optimistic that will continue getting better for a variety of reasons, not least that the technical problems in AI alignment become increasingly well-defined and conversations about those topics will naturally become more focused.)
I'm not convinced that the dynamic "we care a lot about who ends up with power, and more important topics are more relevant to the distribution of power" is a major part of how humanity solves hard human vs nature problems. I do agree that it's an important fact about humans to take into account when trying to solve any problem though.
My sense is that we are on broadly the same page here. I agree that "AI improving AI over time" will look very different from "humans improving humans over time" or even "biology improving humans over time." But I think that it will look a lot like "humans improving AI over time," and that's what I'd use to estimate timescales (months or years, most likely years) for further AI improvements.
I'm guessing the disagreement is that Yudkowsky thinks the holes are giant visible and gaping, whereas you think they are indeed holes but you have some ideas for how to fix them
I think we don't know whether various obvious-to-us-now things will work with effort. I think we don't really have a plan that would work with an acceptably high probability and stand up to scrutiny / mildly pessimistic assumptions.
I would guess that if alignment is hard, then whatever we do ultimately won't follow any existing plan very closely (whether we succeed or not). I do think it's reasonably likely to agree at a very high level. I think that's also true even in the much better worlds that do have tons of plans.
at any rate the plan is to work on fixing those holes and to not deploy powerful AGI until those holes are fixed
I wouldn't say there is "a plan" to do that.
Many people have that hope, and have thought some about how we might establish sufficient consensus about risk to delay AGI deployment for 0.5-2 years if things look risky, and how to overcome various difficulties with implementing that kind of delay, or what kind of more difficult moves might be able to delay significantly longer than that.
Edited to clarify.