So I definitely think that's something weirdly unspoken about the argument; I would characterize it as Eliezer saying "suppose I'm right and they're wrong; all this requires is things to be harder than people think, which is usual. Suppose instead that I'm wrong and they're right; this requires things to be easier than people think, which is unusual." But the equation of "people" and "Eliezer" is sort of strange; as Quintin notes, it isn't that unusual for outside observers to overestimate difficulty, and so I wish he had centrally addressed the the reference class tennis game; is the expertise "getting AI systems to be capable" or "getting AI systems to do what you want"?
FWIW, I thought the bit about manifolds in The difficulty of alignment was the strongest foot forward, because it paints a different detailed picture than your description that it's responding to.
That said, I don't think Quintin's picture obviously disagrees with yours (as discussed in my response over here) and I think you'd find disappointing him calling your description extremely misleading while not seeming to correctly identify the argument structure and check whether there's a related one that goes thru on his model.
During this process, I don’t think it’s particularly unusual for the person to notice a technical problem but overlook a clever way to solve that problem.
I think this isn't the claim; I think the claim is that it would be particularly unusual for someone to overlook that they're accidentally solving a technical problem. (It would be surprising for Edison to not be thinking hard about what filament to use and pick tungsten; in actual history, it took decades for that change to be made.)
BTW I do agree with you that Eliezer’s interview response seems to suggest that he thinks aligning an AGI to “basic notions of morality” is harder and aligning an AGI to “strawberry problem” is easier. If that’s what he thinks, it’s at least not obvious to me.
My sense (which I expect Eliezer would agree with) is that it's relatively easy to get an AI system to imitate the true underlying 'basic notions of morality', to the extent humans agree on that, but that this doesn't protect you at all as soon as you want to start making large changes, or as soon as you start trying to replace specialist sectors of the economy. (A lot of ethics for doctors has to do with the challenges of simultaneously being a doctor and a human; those ethics will not necessarily be relevant for docbots, and the question of what they should be instead is potentially hard to figure out.)
So if you're mostly interested in getting out of the acute risk period, you probably need to aim for a harder target.
seem very implausible when considered in the context of the human learning process (could a human's visual cortex become "deceptively aligned" to the objective of modeling their visual field?).
I think it would probably be strange for the visual field to do this. But I think it's not that uncommon for other parts of the brain to do this; higher level, most abstract / "psychological" parts that have a sense of how things will affect their relevance to future decision-making. I think there are lots of self-perpetuating narratives that it might be fair to call 'deceptively aligned' when they're maladaptive. The idea of metacognitive blindspots also seems related.
John Wentworth describes the possibility of "optimization demons", self-reinforcing patterns that exploit flaws in an imperfect search process to perpetuate themselves and hijack the search for their own purposes.
But no one knows exactly how much of an issue this is for deep learning, which is famous for its ability to evade local minima when run with many parameters.
Also relevant is Are minimal circuits daemon-free? and Are minimal circuits deceptive?. I agree no one knows how much of an issue this will be for deep learning.
Additionally, I think that, if deep learning models develop such phenomena, then the brain likely does so as well.
I think the brain obviously has such phenomena, and societies made up of humans also obviously have such phenomena. I think it is probably not adaptive (optimization demons are more like 'cognitive cancer' than 'part of how values form', I think, but in part that's because the term comes with the disapproval built in).
I think the bolded text is about Yudkowsky himself being wrong.
That is also how I interpreted it.
If you have a bunch of specific arguments and sources of evidence that you think all point towards a particular conclusion X, then discovering that you're wrong about something should, in expectation, reduce your confidence in X.
I think Yudkowsky is making a different statement. I agree it would be bizarre for him to be saying "if I were wrong, it would only mean I should have been more confident!"
Yudkowsky is not the aerospace engineer building the rocket who's saying "the rocket will work because of reasons A, B, C, etc".
I think he is (inside of the example). He's saying "suppose an engineer is wrong about how their design works. Is it more likely that the true design performs better along multiple important criteria than expectation, or that the design performs worse (or fails to function at all)?"
Note that 'expectation' is referring to the confidence level inside an argument, but arguments aren't Bayesians; it's the outside agent that shouldn't be expected to predictably update. Another way to put this: does the engineer expect to be disappointed, excited, or neutral if the design doesn't work as planned? Typically, disappointed, implying the plan is overly optimistic compared to reality.
If this weren't true--if engineers were calibrated or pessimistic--then I think Yudkowsky would be wrong here (and also probably have a different argument to begin with).
Given the greater evidence available for general ML research, being well calibrated about the difficulty of general ML research is the first step to being well calibrated about the difficulty of ML alignment research.
I think I agree with this point but want to explicitly note the switch from the phrase 'AI alignment research' to 'ML alignment research'; my model of Eliezer thinks the second is mostly a distraction from the former, and if you think they're the same or interchangeable that seems like a disagreement.
[For example, I think ML alignment research includes stuff like "will our learned function be robust to distributional shift in the inputs?" and "does our model discriminate against protected classes?" whereas AI alignment research includes stuff like "will our system be robust to changes in the number of inputs?" and "is our model deceiving us about its level of understanding?". They're related in some ways, but pretty deeply distinct.]
If there are experiences which will change itself which don't lead to less of the initial good values, then yeah, for an approximate definition of safety. You're resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don't really see your description as, like, a specific alignment strategy so much as the strategy of "have an alignment strategy at all". The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!