Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.
I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.
This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).
Here’s the plan:
If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).
This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)
This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.
Edited top add the results:
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.
If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.
Examples of interlocking obstacles I think might totally kill my approach:
I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.