Charlie Griffin's Shortform

Charlie Griffin

This is a special post for quick takes by Charlie Griffin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Automating alignment may be harder than automating capabilities, because of ‘unsafe to verify’ tasks

This was a note I wrote for my colleagues on UK AISI's Alignment Team. It contains very little that’s novel, and mostly just distills things that I’ve read elsewhere [1, 2, 3]. Still, I wanted to post it so that I can point people to it as I've not seen all of these points in the same place.

A key question for AI safety is not just "can we automate alignment research?" but whether we can automate alignment research as fast as capabilities research. ^[1]^[2], I think a key factor that could make automating alignment research harder is the potential absence of tasks with safe-ish outcome-based feedback. This might be one component of what makes a "fuzzy task".^[3]

Tasks with safe outcome-based feedback let you try something, observe the outcome, and update. But, in many cases, there's a trade-off between directness and safety of feedback. At one extreme, you roll out the full effect and measure the outcome directly. This is very informative, but potentially catastrophic (the patient is already dead if the drug fails). At the other extreme, you rely entirely on upstream proxies that are safe, but may not track what you care about (e.g. petri dish experiments). Both ‘alignment’ research and ‘capabilities’ research will have tasks along a frontier, but their shapes might be different.

Capabilities for productivity have some safe-ish feedback. Consider three potential training methods for economically productive AI:

Maximise the value of a virtual trading wallet.
Maximise the value of a real trading wallet.
Maximise the valuation of a company by any (legal) means.

There is a tradeoff here of increasing directness, but also financial and legal risk. But, the gradient is potentially smooth, and in even the last case, an irresponsible company might in practice be able to get several bits of feedback.

Sometimes, it's hard to get feedback on “fuzzy tasks” (e.g. “write a good company culture doc”). However, such tasks might be measured in terms of their downstream effect on something you can directly measure (e.g. “profit”). The same applies to capabilities R&D: "is this a good research idea?" is hard to evaluate in isolation, but it eventually cashes out in improved performance on benchmarks.

It seems likely you can get safe-ish, direct-ish feedback on automating AI R&D for profit maximisation (even if rewards might be sparse).

(Note that I’m not claiming that automating capabilities improvements is safe, you may get a misaligned AI system, only that it is not the feedback that is dangerous.)

Alignment might not have any safe-ish feedback. We can't run the end-to-end experiment by deploying a misaligned ASI, observing the catastrophe, then updating the weights. However, less direct proxies are also pretty suspect: e.g. it might not be wise to optimise for "looks like good alignment research to human evaluators".

I think this vignette from “The Case Against AI Control Research” paints this well, but the TLDR is: your early transformative AI produces alignment research that looks good, but you can’t safely verify it is good (by building superintelligence).

A core crux to whether automated alignment research has any hope of keeping up is whether we can find tasks with safe-ish, direct-ish feedback for alignment. This might reflect two other disagreements:

How much has prosaic, empirical alignment work taught us about alignment that generalises to superhuman AIs, across the distributional leap? How much does research that ‘looks good’ on measurable goals actually make progress on hard-to-measure goals?
How much will there be continuity or discontinuity in models?

If a safe-feedback asymmetry between capabilities and alignment persists through recursive improvement, automated alignment could fail even if alignment was in-principle solvable.

Thanks Aleksandr, Marie, Jacob and Kola for feedback.

^{^}
The alignment/capabilities distinction is somewhat blurry in-general, but here I think it points at the difference between getting what you can measure, or getting what you want.
^{^}
If we’re being precise “as fast as” is not the right term here. There’s not an objective 1-1 mapping between measures of research progress, and therefore the speed of progress. It could be that we’re making ~0 progress on automating alignment, or that ‘more’ or ‘less’ research is needed for automating alignment versus capabilities. In fact, the relevant thing is only whether we ‘solve’ alignment for a given capability level in time for that capability level being manifest - which might be fairly discrete if takeoffs are fast.
^{^}
Other components of what makes tasks fuzzy might be reward sparsity - but this problem feels more symmetric between alignment and capabilities.