Question: "effective arguments for the importance of AI safety" - is this about arguments for the importance of just technical AI safety, or more general AI safety, to include governance and similar things?
Think of it as a "practicing a dark art of rationality" post, and I'd think it would seem less off-putting.
Please feel free to repost this elsewhere, and/or tell people about it. And if there is anyone interested in this type of job, but is currently still in school or for other reasons is unable to work full time at present, we encourage them to apply and note the circumstances, as we may be able to find other ways to support their work, or at least collaborate and provide mentorship.
I'm not sure I agree with the compatibility of discontinuity and prosaic alignment, though you make a reasonable case, but I do think there is compatibility between slower governance approaches and discontinuity, if it is far enough away.
In the post, I wanted to distinguish between two things you're now combining; how hard alignment is, and how long we have. And yes, combining these, we get the issue of how hard it will be to solve alignment in the time frame we have until we need to solve it. But they are conceptually distinct.And neither of these directly relates to takeoff speed, which in the current framing is something like the time frame from when we have systems that are near-human until they hit a capability discontinuity. You said "First off, takeoff speed and timing are correlated: if you think HLMI is sooner, you must think progress towards HLMI will be faster, which implies takeoff will also be faster." This last implication might be true, or might not. I agree that there are many worlds in which they are correlated, but there are plausible counter-examples. For instance, we may continue with fast progress and get to HLMI and a utopian freedom from almost all work, but then hit a brick wall on scaling deep learning, and have another AI winter until we figure out how to make actually AGI which can then scale to ASI - and that new approach could lead to either a slow or a fast takeoff. Or we may have progress slow to a crawl due to costs of scaling input and compute until we get to AGI, at which point self-improvement takeoff could be near-immediate, or could continue glacially.And I agree with your claims about why Eliezer is pessimistic about prosaic alignment - but that's not why he's pessimistic about governance, which is a mostly unrelated pessimism.
Relevant to this agenda are the failure modes I discussed in my multi-agent failures paper, which seems worth looking at in this context.
I'm skeptical that many of the problems with aggregation don't both apply to actual individual human values once extrapolated, and generalize to AIs with closely related values, but I'd need to lay out the case for that more clearly. (I did discuss the difficulty of cooperation even given compatible goals a bit in this paper, but it's nowhere near complete in addressing this issue.)
This seems fragile in ways that make me less optimistic about the approach overall. We have strong reasons to think that value aggregation is intractable, and (by analogy,) in some ways the problem of coherence in CEV is the tricky part. That is, the problem of making sure that we're not Dutch book-able is, IIRC, NP-complete, and even worse, the problem of aggregating preferences has several impossibility results.Edit: To clarify, I'm excited about the approach overall, and think it's likely to be valuable, but this part seems like a big problem.
This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)
Yes on point Number 1, and partly on point number 2.If humans don't have incredibly complete models for how to achieve their goals, but know they want a glass of water, telling the AI to put a cup of H2O in front of them can create weird mistakes. This can even happen because of causal connections the humans are unaware of. The AI might have better causal models than the humans, but still cause problems for other reasons. In this case, a human might not know the difference between normal water and heavy water, but the AI might decide that since there are two forms, it should have them present in equal amounts, which would be disastrous for reasons entirely beyond the understanding of the human who asked for the glass of water. The human needed to specify the goal differently, and was entirely unaware of what they did wrong - and in this case it will be months before the impacts of the weirdly different than expected water show up, so human-in-the-loop RL or other methods might not catch it.