Working towards AI alignment is better

Johannes C. Mayer

There are people not motivated to solve AI alignment, who do work related to AI alignment. E.g. people work on adversarial robustness, understanding how to do science mechanically, or on advancing other paradigms that are more interpretable than modern ML. These people might be interested in the science, or work on it for some other personal reason. They probably will do a worse job, compared to, if they would try to advance AI alignment, even when they work on something that is useful for AI alignment.

This basic idea was mentioned by Buck in a talk.

The following is a list of reasons why somebody who tries to solve alignment directly, would be better at solving alignment (though this sentence alone makes it sound obvious):

They are more likely to switch directions once they realize that they could be doing something better with their time, to make progress on AI alignment.
- E.g. somebody who is interested in type theory and then learns that they can help AI alignment might be excited to help, but when there is lots of evidence that they should do something that does not involve type theory, they will keep sticking to doing things with type theory until the end, because their interest in type theory outweighs their desire to advance AI alignment.
The path that they take to solve the problem might look very different.
- It is less likely that they take unpromissing but interesting sidetracks.
- Simplifications that they make to the problem decrease the value of a solution less, in expectation.
- In general, if there are multiple ways to solve the problem, the solution we end up with will likely be more relevant for alignment.
They can employ the full power of their consequentialist reasoning and be agentic about what to do, without starting to goodhart.
- E.g. if you just let them do whatever, they are likely to discover things that are useful that you did not think of before. If somebodies main objective is not to solve AI alignment, it is likely that they will follow whatever looks best to their real motivation, as long as they can find some plausible explanation for why this is useful for AI alignment so that they have an excuse (in the case where they are payed to work on this to advance AI alignment).

There are probably many more points I have not thought of. How much you want to solve alignment compared to other things is a spectrum. What you care about might naturally drift. When you work on something for a long time, you get attached to your work. That's something to keep in mind.

It's interesting to think about the difference, between trying to solve alignment and just doing related work. It can help to notice when we fall into this trap ourselves. Also, getting clear on this might help in doing the good things (e.g. the things in the list above) even more.

AI ALIGNMENT FORUM
AF

Working towards AI alignment is better

3

3