We might be able to train AI alignment assistants that massively accelerate/improve the alignment research that gets done. These assistants need not have (strongly) superhuman capabilities or be highly agentic, they just need to be capable and aligned enough to allow us to offload most work on alignment, safety, and related problems to them.

This seems to be a key part of OpenAI's alignment strategy. The best explanation of this strategy that I've seen is maybe A minimal viable product for alignment, with an excellent discussion in the comment section.

If this strategy is promising, it likely recommends fairly different prioritisation from what the alignment community is currently doing (see, e.g. Beth Barnes's ideas here, or my upcoming post on "non-scalable oversight"---i.e. pragmatic improvements to oversight that would help with training alignment assistant but which cannot directly scale to oversight of superhuman systems).

I haven't seen any deep treatment of the viability of this strategy and its implications. I think such an analysis would be pretty useful.

I could potentially provide funding for such an analysis or help with obtaining funding.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 9:41 AM