The current default view seems to roughly be:
- Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/sub-optimal, but basically it's all about generalizing correctly)
- Scalable oversight is the only useful form of outer alignment research remaining.
- We don't need to worry about sample efficiency in RLHP -- in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a "constition") seems ~good enough.
- But maybe it's not good? Because it's more like capabilities research?
A common example used for motivating scalable oversight is the "AI CEO".
My views are:
- We should not be aiming to build AI CEOs
- We should be aiming to robustly align AIs to perform "simpler" behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge.
- We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs).
- From this PoV, scalable oversight does in fact look mostly like capabilities research.
- However, scalable oversight research can still be justified because "If we don't, someone else will". But this type of replaceability argument should always be treated with extreme caution. The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms. 2) The details matter, and the tipping points are in different places for different types of research/applications, etc.
- It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge ("robustness amplification"). For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe.
- Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures / hybrid systems of some form as well.
- EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a "chicken-and-egg" problem, i.e. the arguments for safety/alignment end up being somewhat circular: "this system is safe because the system we used as an assistant was safe" (but I don't think we've solved the "build a safe assistant" part yet, i.e. we don't have the base case for the induction).
If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.
Indeed, I think current scaling up of language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scaling becomes more risky.
But I think something has gone quite wrong if our collective strategy ends up being "We keep training smarter systems, but fortunately we are only able to fine-tune them to do tasks with external feedback signals from the world." You're going to get increasingly smart systems that can predict and manipulate observable features of the world; this is just asking for a catastrophic failure from reward hacking or deceptive alignment.
Even the world where you keep building larger and larger language models, but avoid ever training them to act in the world, seems like a recipe for trouble. It creates an unstable situation where anyone who fine-tunes a model can cause a catastrophe, or where a tiny behavioral quirk of a model could lead it to take over the world.
If we want to avoid AI systems taking over the world then I think we should strongly prefer to do it by stopping the creation of systems smart enough to do so. That seems like a much better place for norms.
I think this entire picture becomes even more compelling if you are mostly worried about deceptive alignment. I think training bigger models seems like the overwhelmingly dominant risk factor, and making models of a fixed size more useful seems like an extremely plausible intervention for reducing the risk of deceptive alignment.
Perhaps the big difference is that I think it matters a lot whether AI systems are smart enough to be an AI CEO; and not so much whether people are actually trying to employ their AI as a CEO. The risk comes from a deceptive aligned AI (or reward-hacking AI) having that level of competence. If you are in that unfortunate world, then you probably do unfortunately want a bunch of aligned AIs doing complex stuff to help you survive.
If there is an attempt to build consensus in the ML community around "Hey you can train language models just try not to have them do complicated stuff" then I wouldn't push back on it. But I think saying "Don't try to align AI systems that do complex tasks because it interferes with norms against using AIs to do complex tasks" is actively counterproductive. And I think this is a less reasonable ask than "just stop building such big language models," and so I'll argue against it being the main ask the safety community focuses on.
I understand your point of view and think it is reasonable.
However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other. I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.
I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).
When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.
Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.
I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.
(A very quick response):
Agree with (1) and (2).
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".