Nate and Eliezer (Lethality 21) claim that capabilities generalise further than alignment once capabilities start generalising far at all. However, they have not articulated particularly detailed arguments for why this is the case. In this post I collect the arguments for and against the position I have been able to find or generate, and develop them (with a few hours’ effort). I invite you to join me in better understanding this claim and its veracity by contributing your own arguments and improving mine.
Thanks to these people for their help with writing and/or contributing arguments: Vikrant Varma, Vika Krakovna, Mary Phuong, Rory Grieg, Tim Genewein, Rohin Shah.
1. Capabilities have much shorter description length than alignment.
There are simple “laws of intelligence” that underwrite highly general and competent cognitive abilities, but no such simple laws of corrigibility or laws of “doing what the principal means” – or at least, any specification of these latter things will have a higher description length than the laws of intelligence. As a result, most R&D pathways optimising for capabilities and alignment with anything like a simplicity prior (for example) will encounter good approximations of general intelligence earlier than good approximations of corrigibility or alignment.
2. Feedback on capabilities is more consistent and reliable than on alignment.
Reality hits back on cognitive strategies implementing capabilities – such as forming and maintaining accurate beliefs, or making good predictions – more consistently and reliably than any training process hits back on motivational systems orienting around incorrect optimisation targets. Therefore there’s stronger outer optimisation pressure towards good (robust) capabilities than alignment, so we see strong and general capabilities first.
3. There’s essentially only one way to get general capabilities and it has a free parameter for the optimisation target.
There are many paths but only one destination when it comes to designing (via optimisation) a system with strong capabilities. But what those capabilities end up being directed at is path- and prior-dependent in a way we currently do not understand nor have much control over.
4. Corrigibility is conceptually in tension with capability, so corrigibility will fail to generalise when capability generalises well.
Plans that actually work in difficult domains need to preempt or adapt to obstacles. Attempts to steer or correct the target of actually-working planning are a form of obstacle, so we would expect capable planning to resist correction, limiting the extent to which alignment can generalise when capability starts to generalise.
5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.
There is empirical/historical support for capabilities generalising further than alignment to the extent that the analogy of AI development to the evolution of intelligence holds up.
6. Empirical evidence: goal misgeneralisation happens.
There is weak empirical support for capabilities generalising further than alignment in the fact that it is possible to create demos of goal misgeneralisation (e.g., https://arxiv.org/abs/2105.14111).
7. The world is simple whereas the target is not.
There are relatively simple laws governing how the world works, for the purposes of predicting and controlling it, compared to the principles underlying what humans value or the processes by which we figure out what is good. (This is similar to For#1 but focused on knowledge instead of cognitive abilities.) (This is in direct opposition to Against#3.)
8. Much more effort will be poured into capabilities (and d(progress)/d(effort) for alignment is not so much higher than for capabilities to counteract this).
We’ll assume alignment is harder based on the other arguments. For why more effort will be put into capabilities, there are two economic arguments: (a) at lower capability levels there is more profitability in advancing capabilities than alignment specifically, and (b) data about reality in general is cheaper and more abundant than data about any particular alignment target (e.g., human-preference data).
This argument is similar to For#2 but focused more on the incentives faced by R&D organisations and efforts: paths to developing capabilities are more salient and attractive.
9. Alignment techniques will be shallow and won’t withstand the transition to strong capabilities.
There are two reasons: (a) we don’t have a principled understanding of alignment and (b) we won’t have a chance to refine our techniques in the strong capabilities regime.
If advances in a core of general reasoning cause performance on specific domains like bioengineering or psychology to look "jumpy", this will likely happen at the same time as a jump in the ability to understand and deceive the training process, and evade the shallow alignment techniques.
1. Optimal capabilities are computationally intractable; tractable capabilities are more alignable.
For example, it may be that the structure of the cognition of tractable capabilities does not look like optimal planning - there’s no obvious factorisation into goals and capabilities. Convergent instrumental subgoals may not apply strongly to the intelligences we actually find.
2. Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.
In fact we care a lot about models that are deceptive or harmful in non-x-risky ways, and spend massive effort curating datasets that describe safe behaviour. As models get more powerful, we will be able to automate the process of generating better datasets, including through AI assistance. Eventually we will effectively be able to constrain the behaviour of superhuman systems with the sheer quantity and diversity of training data.
3. Alignment only requires building a pointer, whereas capability requires lots of knowledge. Thus the overhead of alignment is small, and can ride increasing capabilities.
(Example of a similar structure, which gives some empirical evidence: millions of dollars to train GPT-3 but only thousands of dollars to finetune on summarisation.)
4. We may have schemes for directing capabilities at the problem of oversight, thus piggy-backing on capability generalisation.
E.g. debate and recursive reward modelling. Furthermore, overseers are asymmetrically advantaged (e.g. because of white-box access or the ability to test in simulation on hypotheticals).
5. Empirical evidence: some capabilities improvements have included corresponding improvements in alignment.
It has proved possible, for example fine-tuning language models on human instructions, to build on capabilities to advance alignment. Extrapolating from this, we might expect alignment to generalise alongside capabilities. For example, billions of tokens are required for decent language capabilities but then only thousands of human feedback points are required to point them at a task.
6. Capabilities might be possible without goal-directedness.
Humans are arguably not strongly goal-directed. We seem to care about lots of different things, and mostly don't end up with a desire to strongly optimise the world towards a simple objective.
Also, we can build tool AIs (such as a physics simulator or a chip designer) which are targeted at such a narrow domain that goal-directedness is not relevant since they aren't strategically located in our world. These AIs are valuable enough to produce economic bounties while coordinating against goal-directed AI development.
7. You don't actually get sharp capability jumps in relevant domains
The AI industry will optimise hard on all economically relevant domains (like bioengineering, psychology, or AI research), which will eliminate capability overhangs and cause progress on these domains to look smooth. This means we get to test our alignment techniques on slightly weaker AIs before we have to rely on them for slightly stronger AIs. This will give us time to refine them into deep alignment techniques rather than shallow ones, which generalise enough.