Nate Soares argues that one of the core problems with AI alignment is that an AI system's capabilities will likely generalize to new domains much faster than its alignment properties. He thinks this is likely to happen in a sudden, discontinuous way (a "sharp left turn"), and that this transition will break most alignment approaches. And this isn't getting enough focus from the field.
Question for the alignment crowd:
The new Absolute Zero Reasoner paper proposes a "self‑play RL with zero external data" paradigm, where the same model both invents tasks and learns to solve them. It reaches SOTA coding + math scores without ever touching a human‑curated dataset. But the authors also flag a few "uh‑oh moments": e.g. their 8‑B Llama variant generated a chain‑of‑thought urging it to "outsmart … intelligent machines and less intelligent humans," and they explicitly list "lingering safety concerns" as an open problem, noting that the system "still necessitates oversight."
My question: How should alignment researchers think about a learner that autonomously expands its own task distribution? Traditional RLHF/RLAIF assumes a fixed environment and lets us shape rewards; here the environment (the task generator) is part of the agent. Do existing alignment proposals—e.g. approval‑based amplification, debate, verifier‑game setups—scale to this recursive setting, or do we need new machinery (e.g. meta‑level corrigibility constraints on the proposer)? I’d love pointers to prior work or fresh ideas on how to keep a self‑improving task‑designer and task‑solver pointed in the right direction before capabilities sprint ahead of oversight.
From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some ...
Full version on arXiv | X
Executive summary
AI risk scenarios usually portray a relatively sudden loss of human control to AIs, outmaneuvering individual humans and human institutions, due to a sudden increase in AI capabilities, or a coordinated betrayal. However, we argue that even an incremental increase in AI capabilities, without any coordinated power-seeking, poses a substantial risk of eventual human disempowerment. This loss of human influence will be centrally driven by having more competitive machine alternatives to humans in almost all societal functions, such as economic labor, decision making, artistic creation, and even companionship.
A gradual loss of control of our own civilization might sound implausible. Hasn't technological disruption usually improved aggregate human welfare? We argue that the alignment of societal systems with human interests has been stable...
While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you've laid out. Here are some thoughts on your specific points:
This post presents a mildly edited form of a new paper by UK AISI's alignment team (the abstract, introduction and related work section are replaced with an executive summary). Read the full paper here.
AI safety via debate is a promising method for solving part of the alignment problem for ASI (artificial superintelligence).
TL;DR Debate + exploration guarantees + solution to obfuscated arguments + good human input solves outer alignment. Outer alignment + online training solves inner alignment to a sufficient extent in low-stakes contexts.
This post sets out:
These gaps form the basis for one of the research agendas of UK AISI’s new alignment team: we aim to dramatically scale up ASI-relevant research on debate. We’ll...
The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I'm mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we're more likely to recognise it and try to get util...
Summary: Both our (UK AISI's) debate safety case sketch and Anthropic’s research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.
The complexity theory models of debate assume some expensive verifier machine with access to a human oracle, such that
Typically, is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves of the tree. Key design elements encapsulated by the verifier machine (rather than by the interactive proof protocol) are those related...
How do you decide what to set ε to? You mention "we want assumptions about humans that are sensible a priori, verifiable via experiment" but I don't see how ε can be verified via experiment, given that for many questions we'd want the human oracle to answer, there isn't a source of ground truth answers that we can compare the human answers to?
With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.
How should I think about, or build up some intuitions...
tl;dr it seems that you can get basic tiling to work by proving that there will be safety proofs in the future, rather than trying to prove safety directly.
"Tiling" here roughly refers to a state of affairs in which we have a program that is able to prove itself safe to run. I'll use a simple problem to keep this post self-contained, but here are some links to some relevant discussion from the past.
The idea I work through below is not new, here is Giles saying it 13 years ago. It's going to be brittle as well, but it seems to me like it's relevant to a general answer for tiling. I'd appreciate engagement, pointers as to why this isn't a great solution, literature references, discussion, etc.
I...
Sorry, I was wrong. By Lob's theorem, all versions of are provably equivalent, so they will trust each other.
I think the most dangerous version of 3 is a sort of Chesterton's fence, where people get rid of seemingly unjustified social norms without realizing that they where socially beneficial. (Decline in high g birthrates might be an example.) Though social norms are instrumental values, not beliefs, and when a norm was originally motivated by a mistaken belief, it can still be motivated by recognizing that the norm is useful, which doesn't require holding on to the mistaken belief.
I think that makes sense, but sometimes you can't necessarily motivate a usef...