I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.
One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actually doing. If our generative model learns a prior such that Azazel is easily accessible by prompting, then that's not a very safe prior, and therefore not a good training goal to have in mind for the model. In the case of characters, what's the difference between the two alignment problems, when both are functionally about wanting certain characters and getting other ones because you interacted with the prior in weird ways?
I think a crux here might be my not really getting why separate inner-outer alignment framings in this form is useful. As stated, the outer alignment problems in both cases feel... benign? Like, in the vein of "these don't pose a lot of risk as stated, unless you make them broad enough that they encroach onto the inner alignment problems", rather than explicit reasoning about a class of potential problems looking optimistic. Which results in the bulk of the problem really just being inner alignment for characters and simulators, and since the former is a subpart of the outer alignment problem for simulators, it just feels like the "risk" aspect collapses down into outer and inner alignment for simulators again.
My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).
My claim is pretty similar to how you put it - in RLHF as in fine-tuning of the kind relevant here, we're focusing the model onto outputs that are generated by better agentic persona. But I think that the effect is particuarly salient with RLHF because it's likely to be scaled up more in the future, where I expect said effect to be exacerbated. I agree with the rest of it, that prompt engineering is unlikely to produce the same effect, and definitely not the same qualitative shift of the world prior.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.
RLHF is less safe than imitation or conditioning generative models.
My picture on why taking ordinary generative models and conditioning them to various ends (like accelerating alignment, for example) is useful relies on a key crux that the intelligence we're wielding is weighted by our world prior. We can expect it to be safe insofar as things normally sampled from the distribution underlying our universe is, modulo arbitrarily powerful conditionals (which degrade performance to an extent anyway) while moving far away from the default world state.
So here's one of my main reasons for not liking RLHF: it removes this very satisfying property. Models that have been RLHF'd (so to speak), have different world priors in ways that aren't really all that intuitive (see Janus' work on mode collapse, or my own prior work which addresses this effect in these terms more directly since you've probably read the former). We get a posterior that doesn't have the nice properties we want of a prior based directly on our world, because RLHF is (as I view it) a surface-level instrument we're using to interface with a high-dimensional ontology. Making toxic interactions less likely (for example) leads to weird downstream effects in the model's simulations because it'll ripple through its various abstractions in ways specific to how they're structured inside the model, which are probably pretty different from how we structure our abstractions and how we make predictions about how changes ripple out.
So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures implied by being weighted by a true approximation of our world.
Another reason for not liking RLHF that's somewhat related to the Anthropic paper you linked: because most contexts RLHF is used involve agentic simulacra, RLHF focuses the model's computation on agency in some sense. My guess is that this explains to an extent the results in that paper - RLHF'd models are better at focusing on simulating agency, agency is correlated with self-preservation desires, and so on. This also seems dangerous to me because we're making agency more accessible to and powerful from ordinary prompting, more powerful agency is inherently tied to properties we don't really want in simulacra, and said agency of a sort is sampled from a not-so-familiar ontology to boot.
(Only skimmed the post for now because I'm technically on break, it's possible I missed something crucial).
Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.
Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.
This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.
I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.
I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can trivially get some amount of acceleration even in worst-case scenarios, etc.
I don't think of these as solutions to alignment as much as reducing the space of problems to worry about. I disagree with OpenAI's approach because it views these as solutions in themselves, instead of as simplified problems.
I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.
In the interest of making my abstract intuition here more precise, a few weird questions:
Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involving interpretability work on the experiment side, plus some kind of theory work figuring out what kinds of meaningful data structures to map the internals of neural networks to.
What does your picture of (realistically) ideal outcomes from theory work look like? Is it more giving interpretability researchers a better frame to reason under (like a more mathematical notion of optimization that we have to figure out how to detect in large nets against adversaries) or something even more ambitious that designs theoretical interpretability processes that Just Work, leaving technical legwork (what ELK seems like to me)?
While they definitely share core ideas of ontology mismatch, it feels like the approaches are pretty different in that you prioritize mathematical definitions a lot and ARC is heuristical. Do you think the mathematical stuff is necessary for sufficient deconfusion, or just a pretty tractable way to arrive at the answers we want?
We can imagine, e.g., the AI imagining itself building a sub-AI while being prone to various sorts of errors, asking how it (the AI) would want the sub-AI to behave in those cases, and learning heuristics that would generalize well to how we would want the AI to behave if it suddenly gained a lot of capability or was considering deceiving its programmers and so on.
I'm not really convinced that even if corrigibility is A Thing (I agree that it's plausible it is, but I think it could also just be trivially part of another Thing given more clarity), it's as good as other medium-term targets. Corrigibility as stated doesn't feel like it covers a large chunk of the likely threat models, and a broader definition seems like it's just rephrasing a bunch of the stuff from Do What I Mean or inner alignment. What am I missing about why it might be as good a target?
generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine
The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.
it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.(...)text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”? It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.
it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which significantly changes out-of-distribution generalization.
text-davinci-002 is not an engine for rendering consistent worlds anymore. Often, it will assign infinitesimal probability to the vast majority of continuations that are perfectly consistent by our standards, and even which conform to the values OpenAI has attempted to instill in it like accuracy and harmlessness, instead concentrating almost all its probability mass on some highly specific outcome. What is it instead, then? For instance, does it even still make sense to think of its outputs as “probabilities”?
It was impossible not to note that the type signature of text-davinci-002’s behavior, in response to prompts that elicit mode collapse, resembles that of a coherent goal-directed agent more than a simulator.
I feel like I'm missing something here, because in my model most of the observations in this post seem like they can be explained under the same paradigm that we view the base davinci model. Specifically, that the reward model RLHF is using "represents" in an information-theoretic sense a signal for the worlds represented by the fine-tuning data. So what RLHF seems to be doing to me is shifting the world prior that GPT learned during pre-training, to one where whatever the reward signal represents is just much more common than in our world - like if GPT's pre-training data inherently contained a hugely disproportionate amount of equivocation and plausible deniability statements, it would just simulate worlds where that's much more likely to occur.
(To be clear, I agree that RLHF can probably induce agency in some form in GPTs, I just don't think that's what's happening here).
The attractor states seem like they're highly likely properties of these resultant worlds, like adversarial/unhinged/whatever interactions are just unlikely (because they were downweighted in the reward model) and so you get anon leaving as soon as he can because that's more likely on the high prior conditional of low adversarial content than the conversation suddenly becoming placid, and some questions actually are just shallowly matching to controversial and the likely response in those worlds is just to equivocate. In that latter example in particular, I don't see the results being that different from what we would expect if GPT's training data was from a world slightly different to ours - injecting input that's pretty unlikely for that world should still lead back to states that are likely for that world. In my view, that's like if we introduced a random segue in the middle of a wedding toast prompt of the form "you are a murderer", and it still bounces back to being wholesome (this works when I tested).
Regarding ending a story to start a new one - I can see the case for why this is framed as the simulator dynamics becoming more agentic, but it doesn't feel all that qualitatively different from what happens in current models - the interesting part seems to be the stronger tendency toward the new worlds the RLHF'd model finds likely, which seems like it's just expected behaviour as a simulator becomes more sure of the world it's in / has a more restricted worldspace. I would definitely expect that if we could come up with a story that was sufficiently OOD of our world (although I think this is pretty hard by definition), it would figure out some similar mechanism to oscillate back to ours as soon as possible (although this would also be much harder with base GPT because it has less confidence of the world it's in) - that is, that the story ending is just one of many levers a simulator can pull, like a slow transition, only here the story was such that ending it was the easiest way to get into its "right" worldspace. I think that this is slight evidence for how malign worlds might arise from strong RLHF (like with superintelligent simulacra), but it doesn't feel like it's that surprising from within the simulator framing.
The RNGs seem like the hardest part of this to explain, but I think can be seen as the outcome of making the model more confident about the world it's simulating, because of the worldspace restriction from the fine-tuning - it's plausible that the abstractions that build up RNG contexts in most of the instances we would try are affected by this (it not being universal seems like it can be explained under this - there's no reason why all potential abstractions would be affected).
Separate thought: this would explain why increasing the temperate doesn't affect it much, and why I think the space of plausible / consistent worlds has shrunk tremendously while still leaving the most likely continuations as being reasonable - it starts from the current world prior, and selectively amplifies the continuations that are more likely under the reward model's worlds. Its definition of "plausible" has shifted; and it doesn't really have cause to shift around any unamplified continuations all that much.
Broadly, my take is that these results are interesting because they show how RLHF affects simulators, their reward signal shrinking the world prior / making the model more confident of the world it should be simulating, and how this affects what it does. A priori, I don't see why this framing doesn't hold, but it's definitely possible that it's just saying the same things you are and I'm reading too much into the algorithmic difference bit, or that it simply explains too much, in which case I'd love to hear what I'm missing.