I think a common failure mode when working on AI alignment^{[1]} is to not focus on the hard parts of the problem first. This is a problem when generating a research agenda, as well as when working on any specific research agenda. Given a research agenda, there are normally many problems that you know how to make progress on. But blindly working on what seems tractable is not a good idea.

Let's say we are working on a research agenda about solving problems A, B, and C. We know that if we find solutions to A, B, and C we will solve alignment. However, if we can't solve even one subproblem, the agenda would be doomed. If C seems like a very hard problem, that you are not sure you can solve, it would be a bad idea to flinch away from C and work on problem A instead, when A seems so much more manageable.

If solving A takes a lot of time and effort, all of that time and effort would be wasted, if you can't solve C in the end. It's especially worrisome when A has tight fightback loops, such that you constantly feel like you are making progress. Or when it is just generally fun to work on A.

Of course, it can make sense to work on A first if you expect this to help you solve C, or at least give you more information on its tractability. The general version of this is illustrated by considering that you have a large list of problems that you need to solve. In this case, focusing on problems that will provide you with information that will be helpful for solving many of the other problems can be very useful. But even then you should not lose sight of the hard problems that might block you down the road.

The takeaway is that these two things are very different:

Solving A as an instrumental subgoal in order to make progress on C, when C is a potential blocker.

Avoiding C, because it seems hard, and instead working on A because it seems tractable.

Though I expect this to be a general problem that comes up all over the place. ↩︎

Here is some obvious advice.

I think a common failure mode when working on AI alignment

^{[1]}is to not focus on the hard parts of the problem first. This is a problem when generating a research agenda, as well as when working on any specific research agenda. Given a research agenda, there are normally many problems that you know how to make progress on. But blindly working on what seems tractable is not a good idea.Let's say we are working on a research agenda about solving problems A, B, and C. We know that if we find solutions to A, B, and C we will solve alignment. However, if we can't solve even one subproblem, the agenda would be doomed. If C seems like a very hard problem, that you are not sure you can solve, it would be a bad idea to flinch away from C and work on problem A instead, when A seems so much more manageable.

If solving A takes a lot of time and effort, all of that time and effort would be wasted, if you can't solve C in the end. It's especially worrisome when A has tight fightback loops, such that you constantly feel like you are making progress. Or when it is just generally fun to work on A.

Of course, it can make sense to work on A first if you expect this to help you solve C, or at least give you more information on its tractability. The general version of this is illustrated by considering that you have a large list of problems that you need to solve. In this case, focusing on problems that will provide you with information that will be helpful for solving many of the other problems

canbe very useful. But even then you should not lose sight of the hard problems that might block you down the road.The takeaway is that these two things are very different:

Though I expect this to be a general problem that comes up all over the place. ↩︎