In order to find the most promising alignment research directions to pour resources into, we can go about it 3 ways
We can imagine the space of all possible research directions
This includes all possible research directions, including pouring resources into McDonald's HR Department. But we can add constraints to focus on research directions more likely to help advance alignment.
If you can tell a story on how it reduces x-risk from AI, then I am slightly more excited about your proposal. But we can continue to add more constraints like:
By constraining more and more, we can narrow down the space to search and (hopefully) avoid dead ends in research. This frame opens up a few questions to ask:
We can also perform this constraint-narrowing on known research agendas (or known problems). A good example is this arbital page on boxed AI, clearly explaining the difficulty of:
Most proposals for a boxed-ai is doomed to fail, but if the proposal competently accounts for (1) & (2) above (which can be considered additional constraints), then I am more excited about this research direction.
Doing a similar process for e.g. interpretability, learning from human feedback, agent foundations research would be very useful. An example would be
I expect to either convince them to change their research direction, be convinced myself, or find the cruxes and make bets/predictions if applicable.
This same process can be applied to alignment-adjacent fields like bias & fairness and task specification in robotics. The result should be "bias & fairness research but it must handle these criticisms/constraints" which is easier to convince those researchers to change to than switching to other alignment fields.
This is also very scalable/can be done in parallel, since people can perform this process on their own research agendas or do separate deep dives into other's research.
The earlier section was mostly on what not to work on, but doesn't tell you what specifically to work on (ignoring established research directions). Here is a somewhat constructive algorithm
For example, Alex Turner's power-seeking work could be broken down into:
which you could break down into different components, but this is what I found interesting. How I formalized (3) can then be mix & matched with other alignment concepts such as mesa-optimizers, deception, & interpretability, which are research directions I approve of.
For overall future work, we can:
I'd greatly appreciate any comments or posts that do any of these three.