I think a potentially promising and undertheorized approach to AI safety, especially in short timelines, is natural language alignment (NLA), a form of AI-assisted alignment in which we leverage the model’s rich understanding of human language to help it develop and pursue a safe notion of social value by bootstrapping up from a natural language expression of alignment, such as giving the nascent AGI a corpus of alignment research and a simple command like, “There's this thing called alignment that we’ve tried to develop in this corpus. Be aligned in the sense we’re trying to get at.” In a world of complex alignment theory, NLA might sound silly, but I think we have a lot of raw power in the self-supervised learning core (the “shoggoth”) of large language models (LLMs) that I think we can leverage to address the ambiguity of “alignment” and “values” alongside other approaches.
A key objection to the NLA paradigm is that the initial command would not contain enough bits of information for alignment, but the hope is that we can leverage an near-AGI LLM+'s rich linguistic understanding to address this. Almost every conventional AI safety researcher I’ve discussed this with says something like, “How would those commands contain enough information to align the AGI, especially given pervasive issues like deception and collusion?” and I think every paradigm needs to answer a version of this.
Very roughly, that could mean: agent foundations gets information from new theorems in math and decision theory; mechanistic interpretability gets information by probing and tuning models piece-by-piece; iterated amplification gets information by decomposing problems into more easily solved subproblems; and NLA gets information from natural language by combining a simple command with the massive self-supervised learner in LLMs that builds an apparently very detailed model of the world from next-word prediction and beam search. In this sense, NLA does not require the natural language command to somehow contain all the bits of a well-developed alignment theory or for the AI to create information out of nothing; the information is gleaned from the seed of that command, the corpuses it was trained on, and its own emergent reasoning ability.
NLA seems like one of the most straightforward pathways to safe AGI from the current capabilities frontier (e.g., GPT-4, AutoGPT-style architectures in which subsystems communicate with each other in natural language), and arguably it’s already the target of OpenAI and Anthropic, but I think it has received relatively little explicit theorization and development in the alignment community. The main work I would put in or near this paradigm is reinforcement learning from AI feedback (RLAIF, like RLHF), such as Anthropic's constitutional AI (Bai et al. 2022), such as by having models that iteratively build better and better constitutions. That process could involve human input, but having “the client” too involved with the process can lead to superficial and deceptive actions so I am primarily thinking of RLAIF, i.e., without the (iterative) H. This could be through empirical approaches, such as building RLAIF alignment benchmarks, or theoretical approaches, such as understanding what constitutes linguistic meaning and how semantic embeddings can align over the course of training or reasoning.
NLA can dovetail complementary approaches like RLHF and scalable oversight. Maybe NLA isn’t anything new relative to those paradigms, and there are certainly many other objections and limitations, but I think it's a useful paradigm to keep in mind.
Thanks for quick pre-publication feedback from Siméon Campos and Benjamin Sturgeon.
“Safe” is intentionally vague here, since the idea is not to fully specify our goals in advance (e.g., corrigibility, non-deception), but to have the model build those through extraction and refinement.
Lawrence Chan calls Anthropic’s research agenda “just try to get the large model to do what you want,” which may be gesturing in a similar theoretical direction.