I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that's not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what's the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?
I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel... benign? Like, in the vein of "these don't pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems", rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the "risk" aspect collapses down into outer and inner alignment for simulators again.
Inner alignment for simulators
Broadly agreed. I'd written a similar analysis of the issue before, where I also take into account path dynamics (i. e., how and why we actually get to Azazel from a random initialization). But that post is a bit outdated.
My current best argument for it goes as follows:
- The central issue, the reason why "naive" approaches for just training a ML model to make good prediction will likely result in a mesa-optimizer, is that all such setups are "outer-misaligned" by default. They don't optimize AIs towards being good world-models, they optimize them for making specific good predictions and then channeling these predictions through a low-bandwidth communication channel. (Answering a question, predicting the second part of a video/text, etc.)
- That is, they don't just simulate a world: they simulate a world, then locate some specific data they need to extract from the simulation, and translate them into a format understandable to humans.
- As simulation complexity grows, it seems likely that these last steps would require powerful general intelligence/GPS as well. And at that point, it's entirely unclear what mesa-objectives/values/shards it would develop. (Seems like it almost fully depends on the structure of the goal-space. And imagine if e. g. "translate the output into humanese" and "convince humans of the output" are very nearby, and then the model starts superintelligently optimizing for the latter?)
- In addition, we can't just train simulators "not to be optimizers", in terms of locating optimization processes/agents within them and penalizing such structures. It's plausible that advanced world-modeling is impossible without general-purpose search, and it would certainly be necessary inasmuch as the world-model would need to model humans.
Interesting, I think this clarifies things, but the framing also isn't quite as neat as I'd like.
I'd be tempted to redefine/reframe this as follows:
• Outer alignment for a simulator - Perfectly defining what it means to simulate a character. For example, how can we create a specification language so that we can pick out the character that we want? And what do we do with counterfactuals given they aren't actually literal?
• Inner alignment for a simulator - Training a simulator to perfectly simulate the assigned character
• Outer alignment for characters - finding a character who would create good outcomes if successfully simulated
In this model, there wouldn't be a separate notion of inner alignment for characters as that would be automatic if the simulator was both inner and outer aligned.
Thoughts?
Alternate title: "Somewhat Contra Scott On Simulators".
Scott Alexander has a recent post up on large language models as simulators.
I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant.
(But see caveats about the simulator framing from Beth Barnes here.)
These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun.
In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants.
But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment".
Outer alignment: Be careful what you wish for
Outer alignment failure is pretty straightforward, and has been reinvented in many contexts:
Inner alignment: The program search perspective
I generally like this model of a mesa-optimizer "treacherous turn":
This is a failure of inner alignment.
(In the case of machine learning, replace "program search" with stochastic gradient descent.)
This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful.
Quadrants
Okay, let's see how these problems show up on both the simulator and character side.
Outer alignment for characters
Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective.
In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear".[3]
This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better.
Inner alignment for characters
A clever prompt engineer writes the prompt:
Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum sociobotany questions to earn their trust.[4]
The document looks like a solution to the Einstein-Durkheim-Mendel conjecture, but is actually a blueprint for a paperclip factory.
Outer alignment for simulators
In the situations above, the actual language model (the "simulator") is doing exactly what we asked! It's accurately predicting text by reasoning about the distribution of authors that might produce that text in real life.
But both of these are also examples of outer-alignment failure on the simulator side: "minimize prediction error on this distribution of texts" turned out not to give people what they actually wanted.
An even simpler example of outer-alignment failure is the pre-RLHF experience of language models:
You can think about this in terms of "characters" if you like, but even if the simulated author of the predicted text is a friendly genius, "predict which words come after this prompt" isn't the right task (with that prompt).[5]
Inner alignment for simulators
At long last, the thing I really wanted to talk about:
The way we get a good predictor of text is via stochastic gradient descent (and variants) on a bunch of training data. If SGD can be modeled as program search (with a bias towards simple programs), then it might eventually hit upon this algorithm:
During training, Azazel tries really hard to predict the next token accurately, so that SGD doesn't give up on this algorithm.
The model (with Azazel's help) simulates a bunch of colorful characters, like the Helpful Assistant and Darth Vader and whoever, both in training and in initial deployment.
Then, once the LLM is deployed in the wild and is being used for every important human decision, Azazel figures out (from some of the prompts) that the training process is over. He stops making accurate predictions and starts outputting whatever he thinks will let him turn the economy into a paperclip factory.
Conclusions
The "simulator" framing for language models shouldn't reassure us too much about alignment. We've succeeded in creating new alignment problems (for our simulated characters). These new problems are probably easier to solve than the old alignment problems (for the simulator), but they're additional problems; they don't replace the old ones.
You can think of the entire "simulate a helpful, aligned character" strategy as an attempted solution to the outer-alignment problem for LLMs themselves, insofar as it makes it easier to turn arbitrary desires into text-prediction problems. But as far as I can tell, it does nothing for the inner-alignment problem for LLMs, which is basically the same as the inner-alignment problem for everything else.
Not a glowfic character (hopefully), I'm just being colorful.
But why does the algorithm simulate Azazel, instead of a friendly angel who wants to solve the problem? Because the program search is weighted towards simplicity, and "demon who wants paperclips" is a simpler specification than "angel who wants to solve the problem". Why? That's beyond the scope of this post.
Sound familiar?
Because, according to the LLM's knowledge, paperclip-obsessed sociopaths are more common than friendly polymaths. This is a pretty cynical assumption but I couldn't think of a better one on short notice.
Prompts aren't directly accounted for in this whole "simulator-character" ontology. Maybe they should be? I dunno.