"Aligned" foundation models don't imply aligned systems

Max H

[The basic idea in this post is probably not original to me, since it's somewhat obvious when stated directly. But it seems particularly relevant and worth distilling in light of recent developments with LLM-based systems, and because I keep seeing arguments which seem confused about it.]

Alignment is a property of agents, or even more generally, of systems. An "aligned model" is usually^[1] a type error.

Often when a new state-of-the-art ML model is developed, the first thing people ask is what it can do when instantiated in the most obvious way possible: given an input, execute the function represented by the model on that input and return the output. I'll refer to this embodiment as the "trivial system" for a given model.

For an LLM, the trivial system generates a single token for a given prompt. There's an obvious extension to this system, which is to feed the predicted token and the initial prompt back into the model again, and repeat until you hit a stop sequence or a maximum length. This is the system you get when you make a single call to the OpenAI text or chat completion API. I'll name this embodiment the "trivial++ system".

You can take this much further, by building a chatbot interface around this API, hooking it up to the internet, running it in an agentic loop, or even more exotic arrangements of your own design. These systems have suddenly started working much better with the release of GPT-4, and AI capabilities researchers are just getting started. The capabilities of any particular system will depend on both the underlying model and the ways it is embodied: concretely, you can improve Auto-GPT by: (a) backing it with a better foundation model, (b) giving it access to more APIs and tools, and (c) improving the code, prompting, and infrastructure that it runs on.

Perhaps models to which techniques like RLHF have been applied will make it easy to build aligned systems and hard or impossible to build unaligned systems, but this is far from given.

Things like RLHF and instruction prompt tuning definitely make it easier to build powerful systems out of the foundation models to which they are applied. Does an RLHF'd model make it easier to build Order-GPT and harder to build Chaos-GPT? (I'm not sure; no Auto-GPT applications seem to really be working particularly well yet, but I wouldn't count on that trend continuing.)

So while RLHF is definitely a capabilities technique, it remains to be seen whether it can be called an alignment technique when applied to models embedded in non-trivial systems (though it is quite effective at getting the trivial++ system for GPT-4 not to say impolite things.)

Some examples where others seem confused or not careful about this distinction

From Evolution provides no evidence for the sharp left turn:

Let’s consider eight specific alignment techniques:
Reinforcement learning from human feedback
Constitutional AI
Instruction prompt tuning
Discovering Language Model Behaviors with Model-Written Evaluations
Pretraining Language Models with Human Preferences
Discovering Latent Knowledge in Language Models Without Supervision
More scalable methods of process based supervision
Using language models to write their own instruction finetuning data

Most of these seem like useful techniques for building or understanding larger foundation models. None of them self-evidently help with the alignment of nontrivial systems which use executions of those foundation models as a component, though it wouldn't surprise me if at least some of them were relevant.

From a different section of the same post:

In my frame, we've already figured out and applied the sharp left turn to our AI systems, in that we don't waste our compute on massive amounts of incredibly inefficient neural architecture search, hyperparameter tuning, or meta optimization. For a given compute budget, the best (known) way to buy capabilities is to train a single big model in accordance with empirical scaling laws such as those discovered in the Chinchilla paper, not to split the compute budget across millions of different training runs for vastly tinier models with slightly different architectures and training processes. In fact, we can be even more clever and use small models to tune the training process, before scaling up to a single large run, as OpenAI did with GPT-4.

"AI systems" in the first sentence should probably be "AI models". And then, I'd say that the best way to advance capabilities is to train a single big model and then wrap it up in a useful system with a slick UX and lots of access to APIs and other functionality.

From Deconfusing Direct vs Amortised Optimization:

Amortized optimization, on the other hand, is not directly applied to any specific problem or state. Instead, an agent is given a dataset of input data and successful solutions, and then learns a function approximator that maps directly from the input data to the correct solution. Once this function approximator is learnt, solving a novel problem then looks like using the function approximator to generalize across solution space rather than directly solving the problem.

The use of the word "agent" seems like a type error here. A function approximator is trained on a dataset of inputs and outputs. An agent may then use that function approximator as one tool in its toolbox when presented with a problem. If the problem is similar enough to the ones the function approximator was designed to approximate, it may be that the trivial system that executes the function approximator on the test input is sufficient, and you don't need to introduce an agent into the situation at all.

From My Objections to "We’re All Gonna Die with Eliezer Yudkowsky":

(emphasis mine, added to highlight the relevant parts in a longer quote without dropping the context.)

Yudkowsky brings up strawberry alignment
I mean, I wouldn't say that it's difficult to align an AI with our basic notions of morality. I'd say that it's difficult to align an AI on a task like 'take this strawberry, and make me another strawberry that's identical to this strawberry down to the cellular level, but not necessarily the atomic level'. So it looks the same under like a standard optical microscope, but maybe not a scanning electron microscope. Do that. Don't destroy the world as a side effect."

My first objection is: human value formation doesn't work like this. There's no way to raise a human such that their value system cleanly revolves around the one single goal of duplicating a strawberry, and nothing else. By asking for a method of forming values which would permit such a narrow specification of end goals, you're asking for a value formation process that's fundamentally different from the one humans use. There's no guarantee that such a thing even exists, and implicitly aiming to avoid the one value formation process we know is compatible with our own values seems like a terrible idea.
It also assumes that the orthogonality thesis should hold in respect to alignment techniques - that such techniques should be equally capable of aligning models to any possible objective.
This seems clearly false in the case of deep learning, where progress on instilling any particular behavioral tendencies in models roughly follows the amount of available data that demonstrate said behavioral tendency. It's thus vastly easier to align models to goals where we have many examples of people executing said goals. As it so happens, we have roughly zero examples of people performing the "duplicate this strawberry" task, but many more examples of e.g., humans acting in accordance with human values, ML / alignment research papers, chatbots acting as helpful, honest and harmless assistants, people providing oversight to AI models, etc. See also: this discussion.

Again, "aligning a model", "behavioral tendencies in models" and "providing oversight to AI models" are type errors in my ontology. This type error is important, because it hides an implicit assumption that either any systems in which "aligned" models are embedded will also be aligned, or that people will build only trivial systems out of these models. Both of these assumptions have already been falsified in rather dramatic fashion.

My mental model of the authors I've cited understand and mostly agree with the basic distinction I've made in this post. However, I think they were being not very careful about tracking it in some of the arguments and explanations above, and that this is more than just a pedantic point or argument over definitions.

This distinction is important, because AI system capabilities are currently advancing rapidly, independent of any DL paradigm-based improvements in foundation models. The very first paragraph of one of the posts quoted above summarizes the "sharp left turn" argument as factoring through SGD, but SGD is not the only way of pushing the capabilities frontier, and may not be the main one for much longer, as GOFAI approaches come back into vogue. (A possibility which, I note, the original authors of the sharp left turn argument foresaw.)

^{^}
Maybe if the models and training runs get large enough, an inner optimizer develops somewhere inside and even a single output from the trivial system embodied by a model becomes dangerous.
Or, perhaps the inner optimizer develops and "breaks out" entirely during the training process itself.
Regardless of how likely those failure modes are, we'll probably encounter less exotic-looking failure modes of systems which are built on top of non-dangerous models first, which is the topic of this post. Mesaoptmizers are just another possible way we might fail later, if we manage to solve some easier problems first.

14

"Aligned" foundation models don't imply aligned systems

14

Some examples where others seem confused or not careful about this distinction

Yudkowsky brings up strawberry alignment