One major question that heavily influences the choice of alignment research directions is the degree to which incremental improvements are necessary for major paradigm shifts. Because alignment is largely preparadigmatic, we may require a paradigm shift before we can make substantial progress towards aligning superhuman AI systems, rather than merely incremental improvements. The answer to this question determines whether the best approach to alignment is to choose metrics and try to make incremental progress on alignment research questions, or to mostly fund things that are long shots, or something else entirely. It also informs what policy we should take with respect to capabilities acceleration as an externality to alignment work: the degree to which incremental improvements in capabilities lead to paradigm shifts in capabilities informs how much we should worry about incremental capabilities improvements as a byproduct.   

Some possible ways-the-world-could-be include:

  • Incremental improvements have negligible impact on when paradigm shifts happen and could be eliminated entirely without any negative impact on when paradigm shifts occur. All or the vast majority of incremental work is visible from the start as low risk low reward, and potentially paradigm shift causing work is visible from the start as high risk high reward.
  • Incremental improvements serve to increase attention in the field and thus increase the amount of funding for the field as a whole, thereby proportionally increasing the absolute number of people working on paradigmatic directions, but funding those working on potential paradigm shifts directly would yield the same paradigm shifts at the same time
  • Incremental improvements are necessary to convince risk averse funding sources to continue funding something, since putting money into something for years with no visible output is not popular with many funders, and thus forces researchers to divert a certain % of their time to working on funder-legible incremental improvements.
  • Most paradigm shifts arise from attempts to make incremental improvements that accidentally uncover something deeper in the process. It is difficult to tell before embarking on a project whether it will only yield an incremental improvement, no improvement at all, or a paradigm shift.
  • Most paradigm shifts cannot occur until incremental improvements lay the foundation for the paradigm shift to happen, no matter how much effort is put into trying to recognize paradigm shifts.
  • Something else?

I think figuring out which of these universes we're in would be enormously valuable and seems especially high leverage. I think in particular there are likely to be many different perspectives on this and a lot of productive disagreement between those perspectives, so I don't think multiple people working on this has a stepping-on-toes effect. 


New Comment