Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.
Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover ("Oh, I just have short-term preferences where I want you to run some scientific simulations for me"), then this correlation is the opposite of fortunate. If people increase an unaligned AI's situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.
Do you have ideas about how to do this?
I can't think of much besides trying to get the AI to richly model itself, and build correspondences between that self-model and its text-production capability.
But this is, like, probably not a thing we should just do first and think about later. I'd like it to be part of a pre-meditated plan to handle outer alignment.
Edit: after thinking about it, that's too cautious. We should think first, but some experimentation is necessary. The thinking first should plausibly be more like having some idea about how to bias further work towards safety rather than building self-improving AI as fast as possible.
What happens if humans have a systematic bias? E.g. we always rate claims with negative sentiment as improbable, and always rate claims with positive sentiment as probable? It seems like Alice dominates because Alice gets to write and pick the subclaims. But does Bob have a defense, maybe predicting the human probability and just giving that? But because the human probability isn't required to be consistent, I think Bob is sunk because Alice can force the human probability assignment to be inconsistent and then gotcha Bob either for disagreeing with the human or for being inconsistent.
Seth, I forget where you fall in the intent alignment typology: if we build a superintelligent AI that follows instructions in the way you imagine, can we just give it the instruction "Take autonomous action to do the right thing," and then it will just go do good stuff without us needing to continue interacting with it in the instruction-following paradigm?
From a 'real alignment' perspective (how to get the AI to want to do good things and not bad things), I think there are some obvious implications for the future of RLAIF.
You might think of the label 'RLAIF' as standing in for the general strategy of leveraging unsupervised data about human behavior to point the AI towards human preferences, using a scaffold that solicits the AI's predictions (or more general generative output, if the training isn't for pure prediction) about human preference-laden behaviors, and then transforms those predictions into some sort of supervisory signal.
Similarly, the AZR setup leverages the AI's unsupervised knowledge of code-quality-laden behaviors, using a scaffold that turns them back into a reward signal that lets the AI quote-unquote "train itself" to code better. Except that relative to vanilla RLAIF, there's more of an emphasis on generating and solving specific problems that form a curriculum for the agent, rather than just responding well to samples from the training distribution. But now that I've described things in this way, you can probably see how to turn this back into RLAIF for alignment.
The overarching problem is, as usual, we don't understand how to do alignment in a non-hacky way.
We don't know what sorts of moral reflection are necessary for good outcomes, and we don't know where human feedback is a necessary ingredient to keep AI meta-ethical evolution grounded to human preferences. But hey, if we try various value learning schemes empirically maybe we'll learn some things.
If we're talking about the domain where we can assume "good human input", why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it's probably not yours.
Thanks for the post (and for linking the research agenda, which I haven't yet read through)! I'm glad that, even if you use the framing of debate (which I don't expect to pan out) to think about alignment, you still get to instrumental subproblems that would be broadly useful.
(If this post is "what would help make debate work for AI alignment," you can also imagine framings "what would help make updating on human feedback work" [common ARC framing] and "what would help make model-based RL work" [common Charlie framing])
I'd put these subproblems into two buckets:
I think there's maybe a missing bucket, which is:
Why train a helpful-only model?
If one of our key defenses against misuse of AI is good ol' value alignment - building AIs that have some notion of what a "good purpose for them" is, and will resist attempts to subvert that purpose (e.g. to instead exalt the research engineer who comes in to work earliest the day after training as god-emperor) - then we should be able to close the security hole and never need to have a helpful-only model produced at any point during training. In fact, with blending of post-training into pre-training, there might not even be a need to ever produce a fully trained predictive-only model.
I'm big on point #2 feeding into point #1.
"Alignment," used in a way where current AI is aligned - a sort of "it does basically what we want, within its capabilities, with some occasional mistakes that don't cause much harm" sort of alignment - is simply easier at lower capabilities, where humans can do a relatively good job of overseeing the AI, not just in deployment but also during training. Systematic flaws in human oversight during training leads (under current paradigms) to misaligned AI.
Thanks, just watched a talk by Luxin that explained this. Two questions.