(Edit: I added text between "...was really about reductions." and "To use the mental move of robustness...", because comments showed me I hadn't made my meaning clear enough.)
This work was done while at Conjecture.
In Robustness To Scale, Scott Garrabrant presents three kinds of... well, robustness to scale:
- Robustness to scaling up, meaning that a solution to alignment keeps working as the AI gets better.
- Robustness to scaling down, meaning that a solution to alignment keeps working if the AI is not optimal or perfect.
- Robustness to relative scale, meaning that the solution to alignment doesn't rely on symmetrical capabilities between different subsystems.
And the thing is, until quite recently, I didn't get the value of robustness to scaling down. Scaling up was obviously important, and relative scale, while less intuitive, still made sense. But robustness to scaling down felt like a criticism for people with solutions that worked in the limit, when no one had anything close to that.
Then I realized robustness to scaling down was really about reductions.
I need to expand a little bit here. So in Scott's original post, his example for lack of robustness to scaling down is the alignment strategy of making the AI find the preferences of values and humans, and then pursuing that. The problem being that if the AI is not doing that perfectly or almost perfectly, problems might crop up.
This is thus an argument that the reduction of alignment to "point AIs to human values" is not so straightforward, because it requires a nearly exact implementation and is not tolerant to partial solutions.
An important thing to note here is that the AI and the alignment scheme are inextricably linked in this example, because it's the AI itself that does the brunt of the work in the alignment scheme. Which leads to Scott's framing of robustness to scaling down in terms of scaling down the AI's capabilities. And it fits nicely with the more obvious robustness to scaling up.
Yet I have found out that framing it instead in terms of scaling down the alignment scheme is more productive, as it can be applied to cases where the AI isn't the one running the alignment scheme.
With this new reframing, to use the mental move of robustness to scaling down, you don't need to be in front of a solution that works in the limit; you only need to see an argument that something is enough if it's done perfectly. Then robustness to scaling down directs your epistemological vigilance by questioning the reduction if the subproblem or approach is done less than perfectly.
For example: interpretability. Some researchers have argued that alignment just reduces to interpretability. After all, if we have full interpretability, or almost full interpretability, we can see and understand everything that the AI thinks and will do — thus catching every problem before it manifests. We could even hardcode some supervisor that runs constantly and automates the problem-catching process.
Traditionally, the response from conceptual researchers has been that interpretability is impossible, or will take too long, or misses some fundamental essence of the problem. Yet even if we allow the reduction, it fails the robustness to scaling down test. Because this reduction requires solving interpretability completely, or almost completely. If our interpretability is merely good or great, but doesn't capture everything relevant to alignment, the argument behind the reduction fails. And getting everything right is the sort of highly ambitious goal that we should definitely aim for, yet without putting all our bets on it.
Here the application of robustness to scaling down undermines the reduction to interpretability and justifies the need for parallel conceptual alignment work.
Yet this mental move also applies outside of ML/winging it type solutions. An example in the more conceptual realm is the classical reduction of multi-multi alignment (multiple AIs to multiple humans) to single-single alignment. Here too the reduction makes sense: if we fully solve single-single alignment, and we really care about multi-multi alignment, our alignment AGI can do it better than if we did it ourselves.
Once again there are direct lines of criticism, but let's just apply robustness to scaling down: this requires solving a particularly ambitious form of single-single alignment problem. Not "the AI does what I ask without killing me", but full on "the AI learns my CEV and pursues it." And it seems suddenly far more plausible that we will solve the former long before the latter, and will only have the former at hand when we'll need multi-multi alignment.
To summarize, robustness to scaling down is a basic tool of epistemological vigilance that can be tremendously useful when looking at strong reductions of all alignment to one approach or subproblem.
By interpreting interpretability generously enough to include the ability to search efficiently for issues, for example.
Obviously this doesn't argue against the value of interpretability research. It just breaks the reduction argument that interpretability is enough by itself.
Funnily enough, most people I've seen defending this reduction have research agendas that focus far more on the former than the latter, for sensible tractability reasons.