The current default view seems to roughly be:

  • Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/sub-optimal, but basically it's all about generalizing correctly)
  • Scalable oversight is the only useful form of outer alignment research remaining.
  • We don't need to worry about sample efficiency in RLHP -- in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a "constition") seems ~good enough.
  • But maybe it's not good?  Because it's more like capabilities research?

A common example used for motivating scalable oversight is the "AI CEO".

My views are:

  • We should not be aiming to build AI CEOs
  • We should be aiming to robustly align AIs to perform "simpler" behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge.
  • We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs).
  • From this PoV, scalable oversight does in fact look mostly like capabilities research.
  • However, scalable oversight research can still be justified because "If we don't, someone else will".  But this type of replaceability argument should always be treated with extreme caution.  The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms.  2) The details matter, and the tipping points are in different places for different types of research/applications, etc.
  • It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge ("robustness amplification").  For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe.
  • Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures / hybrid systems of some form as well.
  • EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a "chicken-and-egg" problem, i.e. the arguments for safety/alignment end up being somewhat circular: "this system is safe because the system we used as an assistant was safe" (but I don't think we've solved the "build a safe assistant" part yet, i.e. we don't have the base case for the induction).


New Comment
9 comments, sorted by Click to highlight new comments since:

If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.

Indeed, I think accelerating progress in scaling up language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scaling becomes more risky.

But I think something has gone quite wrong if our collective strategy ends up being "We keep training smarter systems, but fortunately we are only able to fine-tune them to do tasks with external feedback signals from the world." You're going to get increasingly smart systems that can predict and manipulate observable features of the world; this is just asking for a catastrophic failure from reward hacking or deceptive alignment.

Even the world where you keep building larger and larger language models, but avoid ever training them to act in the world, seems like a recipe for trouble. It creates an unstable situation where anyone who fine-tunes a model can cause a catastrophe, or where a tiny behavioral quirk of a model could lead it to take over the world.

If we want to avoid AI systems taking over the world then I think we should strongly prefer to do it by stopping the creation of systems smart enough to do so. That seems like a much better place for norms.

I think this entire picture becomes even more compelling if you are mostly worried about deceptive alignment. I think training bigger models seems like the overwhelmingly dominant risk factor, and making models of a fixed size more useful seems like an extremely plausible intervention for reducing the risk of deceptive alignment.

Perhaps the big difference is that I think it matters a lot whether AI systems are smart enough to be an AI CEO; and not so much whether people are actually trying to employ their AI as a CEO. The risk comes from a deceptive aligned AI (or reward-hacking AI) having that level of competence. If you are in that unfortunate world, then you probably do unfortunately want a bunch of aligned AIs doing complex stuff to help you survive.

If there is an attempt to build consensus in the ML community around "Hey you can train language models just try not to have them do complicated stuff" then I wouldn't push back on it. But I think saying "Don't try to align AI systems that do complex tasks because it interferes with norms against using AIs  to do complex tasks" is actively counterproductive. And I think this is a less reasonable ask than "just stop building such big language models," and so I'll argue against it being the main ask the safety community focuses on.

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).


if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time


Is this really the case? If you imagine a perfect Oracle AI, which is certainly not agenty, it seems to me that with some simple scaffolding, one could construct a highly agentic system. It would go something along the lines of 

  1. Setup API access to 'things' which can interact with the real world. 
  2. Ask the oracle 'What would be the optimal action if you want to do <insert-goal> via <insert-api-functions>?'
  3. Do the actions that are outputted.
  4. Some kind of looping mechanism to gain feedback from the world and account for it.

This is my line of reasoning why AIS matters for language models in general.

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

The thing that we care about is how long it takes to get to agents. If we put lots of effort making powerful Oracle systems or other non-agentic systems, we must assume that agentic systems will follow shortly. Someone will make them, even if you do not. 

I don't disagree... in this case you don't get agents for a long time; someone else does though.

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.

Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.

I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.

  • Parts of generalizing correctly involve outer alignment. I.e. building objective functions that have "something to say" about how humans want the AI to generalize.
  • Relatedly, outer alignment research is not done, and RLHF/P is not the be-all-end-all.
  • I think we should be aiming to build AI CEOs (or more generally, working on safety technology with an eye towards how it could be used in AGI that skillfully navigates the real world). Yes, the reality of the game we're playing with gung-ho orgs is more complicated, but sometimes, if you don't, someone else really will.
  • Getting AI systems to perform simpler behaviors safely also looks like capabilities research. When you say "this will likely require improving sample efficiency," a bright light should flash. This isn't a fatal problem - some amount of advancing capabilities is just a cost of doing business. There exists safety research that doesn't advance capabilities, but that subset has a lot of restrictions on it (little connection to ML being the big one). Rather than avoiding ever advancing AI capabilities, we should acknowledge that fact in advance and try to make plans that account for it.

(A very quick response):

Agree with (1) and (2).  
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".