Epistemic status: half-baked
Arguably, an aligned AI should be aligned to the user's prior as well as the user's utility function. Hence, any value-learning protocol should also be doing prior-learning. The problem is, any learning process requires (explicitly or implicitly) its own prior. But shouldn't this also be the user's prior? Is this an infinite regress? Maybe not: here is a way out that seems elegant in a way.
For now, we will work in the Bayesian framework. Let
So my current impression is basically that you're optimistic about something similar to this:
Perhaps the most elegant thing would be to have an account of some particular kind of reasoning or learning that could be used to construct/refine priors, where we’d be able to show that it doesn’t run into the same issues as the ones we run into when we modify our beliefs in response to observations like this or in response to proofs like this.
And that your argument here is an argument for why it won't be possible to make double-update arguments about this more n...
The strategy-stealing assumption posits that for any strategy an unaligned AI could use to influence the long-term future, there is an analogous strategy that humans could use to capture similar influence. Paul Christiano explores why this assumption might be true, and eleven ways it could potentially fail.
Automating alignment may be harder than automating capabilities, because of ‘unsafe to verify’ tasks
This was a note I wrote for my colleagues on UK AISI's Alignment Team. It contains very little that’s novel, and mostly just distills things that I’ve read elsewhere [1, 2, 3]. Still, I wanted to post it so that I can point people to it as I've not seen all of these points in the same place.
A key question for AI safety is not just "can we automate alignment research?" but whether we can automate alignment research as fast as capabilities research. [1][2], I ... (read more)