(Edit: I added text between "...was really about reductions." and "To use the mental move of robustness...", because comments showed me I hadn't made my meaning clear enough.)
This post is part of the work done at Conjecture.
In Robustness To Scale, Scott Garrabrant presents three kinds of... well, robustness to scale:
- Robustness to scaling up, meaning that a solution to alignment keeps working as the AI gets better.
- Robustness to scaling down, meaning that a solution to alignment keeps working if the AI is not optimal or perfect.
- Robustness to relative scale, meaning that the solution to alignment doesn't rely on symmetrical capabilities between different subsystems.
And the thing is, until quite recently, I didn't get the value of robustness to scaling down. Scaling up was obviously important, and relative scale, while less intuitive, still made sense. But robustness to scaling down felt like a criticism for people with solutions that worked in the limit, when no one had anything close to that.
Then I realized robustness to scaling down was really about reductions.
I need to expand a little bit here. So in Scott's original post, his example for lack of robustness to scaling down is the alignment strategy of making the AI find the preferences of values and humans, and then pursuing that. The problem being that if the AI is not doing that perfectly or almost perfectly, problems might crop up.
This is thus an argument that the reduction of alignment to "point AIs to human values" is not so straightforward, because it requires a nearly exact implementation and is not tolerant to partial solutions.
An important thing to note here is that the AI and the alignment scheme are inextricably linked in this example, because it's the AI itself that does the brunt of the work in the alignment scheme. Which leads to Scott's framing of robustness to scaling down in terms of scaling down the AI's capabilities. And it fits nicely with the more obvious robustness to scaling up.
Yet I have found out that framing it instead in terms of scaling down the alignment scheme is more productive, as it can be applied to cases where the AI isn't the one running the alignment scheme.
With this new reframing, to use the mental move of robustness to scaling down, you don't need to be in front of a solution that works in the limit; you only need to see an argument that something is enough if it's done perfectly. Then robustness to scaling down directs your epistemological vigilance by questioning the reduction if the subproblem or approach is done less than perfectly.
For example: interpretability. Some researchers have argued that alignment just reduces to interpretability. After all, if we have full interpretability, or almost full interpretability, we can see and understand everything that the AI thinks and will do — thus catching every problem before it manifests. We could even hardcode some supervisor that runs constantly and automates the problem-catching process.
Traditionally, the response from conceptual researchers has been that interpretability is impossible, or will take too long, or misses some fundamental essence of the problem. Yet even if we allow the reduction, it fails the robustness to scaling down test. Because this reduction requires solving interpretability completely, or almost completely. If our interpretability is merely good or great, but doesn't capture everything relevant to alignment, the argument behind the reduction fails. And getting everything right is the sort of highly ambitious goal that we should definitely aim for, yet without putting all our bets on it.
Here the application of robustness to scaling down undermines the reduction to interpretability and justifies the need for parallel conceptual alignment work.
Yet this mental move also applies outside of ML/winging it type solutions. An example in the more conceptual realm is the classical reduction of multi-multi alignment (multiple AIs to multiple humans) to single-single alignment. Here too the reduction makes sense: if we fully solve single-single alignment, and we really care about multi-multi alignment, our alignment AGI can do it better than if we did it ourselves.
Once again there are direct lines of criticism, but let's just apply robustness to scaling down: this requires solving a particularly ambitious form of single-single alignment problem. Not "the AI does what I ask without killing me", but full on "the AI learns my CEV and pursues it." And it seems suddenly far more plausible that we will solve the former long before the latter, and will only have the former at hand when we'll need multi-multi alignment.
To summarize, robustness to scaling down is a basic tool of epistemological vigilance that can be tremendously useful when looking at strong reductions of all alignment to one approach or subproblem.
By interpreting interpretability generously enough to include the ability to search efficiently for issues, for example.
Obviously this doesn't argue against the value of interpretability research. It just breaks the reduction argument that interpretability is enough by itself.
Funnily enough, most people I've seen defending this reduction have research agendas that focus far more on the former than the latter, for sensible tractability reasons.
You define robustness to scaling down as "a solution to alignment keeps working if the AI is not optimal or perfect." but for interpretability you talk about "our interpretability is merely good or great, but doesn't capture everything relevant to alignment" which seems to be about the alignment approach/our understanding being flawed not the AI. I can imagine techniques being robust to imperfect AI but find it harder to imagine how any alignment approach could be robust if the approach/our implementation of the approach itself is flawed, do you have any example of this?
That's a great point!
There's definitely one big difference between how Scott defined it and how I'm using it, which you highlighted well. I think a better way of explaining my change is that in Scott's original example, the AI being flawed result in some sense in the alignment scheme (predict human values and do that) to be flawed too.
I hadn't made the explicit claim in my head or in the post, but thanks to your comment, I think I'm claiming that the version I'm proposing generalize one of the interesting part of the original definition, and let it be applied to more settings.
As for your question, there is a difference between flawed and not the strongest version. What I'm saying about interpretability and single-single is not that a flawed implementation of them would not work (which is obviously trivial), but that for the reductions to function, you need to solve a particularly ambitious form of the problem. And that we don't currently have a good reason to expect to solve this ambitious problem with enough probability to warrant trusting the reduction and not working on anything else.
So an example of a plausible solution (of course I don't have a good solution at hand) would be to create sufficient interpretability techniques that, when combined with conceptual and mathematical characterizations of problematic behaviours like deception, we're able to see if a model will end up having these problematic behaviours. Notice that this possible solution requires working on conceptual alignment, which the reduction to interpretability would strongly discourage.
To summarize, I'm not claiming that interpretability (or single-single) won't be enough if it's flawed, just that reducing the alignment problem (or multi-multi) to them is actually a reduction to an incredibly strong and ambitious version of the problem, that no one is currently tackling this strong version, and that we have no reason to expect to solve the strong version with such high probability that we should shun alternatives and other approaches.
Does that clarify your confusion with my model?
Yep, that clarifies.
Here is another interpretation of what can cause a lack of robustness to scaling down:
(Maybe this is what you have in mind when you talk about single-single alignment not (necessaeraily) scaling to multi-multi alignment - but I am not sure that is the case, and even if it ism I feel pulled to stating it again more as I don't think it comes out as clearly as I would want it to in the original post.)
Taking the example of an "alignment strategy [that makes] the AI find the preferences of values and humans, and then pursu[e] that", robustness to scaling down can break if "human values" (as invoked in the example) don't "survive" reductionism; i.e. if, when we try to apply reducitonism to "human values", we are left with "less" than what we hoped for.
This is the inverse of saying that there is an important non-linearity when trying to scale (up) from single-single alignment to multi-multi alignment.
I think interpretation locates the reason for the lack of robustness in neither capabilities nor alignment regime, which is why I wanted to raise it. It's a claim about the nature or structural properties of "human values"; or a hint that we are deeply confused about human values (e.g. that the term currently refers to an incohernet cluster or "unnatural" abstraction)).
What you say about CEV might capture this fully, but whether it does, I think, is an empirical claim of sorts; a proposed solution to the more general diagnosis that I am trying to propose, namely that (the way we currently use the term) "human values" may itself not be robust to scaling down.
This is a useful idea, it acts to complement the omnipotence test where you ask if AI as a whole still does the right thing if it's scaled up to an absurd degree (but civilization outside the AI isn't scaled up, which is like its part for alignment purposes). In particular, any reflectively stable maximizer that's not aimed exactly and with no approximations at CEV fails this because goodhart. The traditional answer is to aim it exactly, while the more recent answer is to prevent maximization at the decision theory level, so that acting very well is still not maximization.
Robustness to scaling down instead makes some parts of the system ineffectual, even if that shouldn't plausibly happen, and considers what happens then, asks if the other parts would take advantage and cause trouble. What if civilization, seen as a part of the AI for purposes of alignment, holding its values, doesn't work very well, would AI-except-civilization cause trouble?
I imagine the next step should have some part compromised by a capable adversary, acting purposefully to undermine the system. Robustness to catastrophic failure in a part of the design rather than to scaling down. This seems related to inner alignment and corrigibility: making sure parts don't lose their purposes, while having the parts themselves cooperate with fixing their purposes and not acting outside their purposes.