All of Jozdien's Comments + Replies

I have a post from a while back with a section that aims to do much the same thing you're doing here, and which agrees with a lot of your framing. There are some differences though, so here are some scattered thoughts.

One key difference is that what you call "inner alignment for characters", I prefer to think about as an outer alignment problem to the extent that the division feels slightly weird. The reason I find this more compelling is that it maps more cleanly onto the idea of what we want our model to be doing, if we're sure that that's what it's actu... (read more)


My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.

1Sam Marks4mo
This, broadly-speaking, is also my best guess, but I'd rather phrase it as: larger LMs are better at making the personas they imitate "realistic" (in the sense of being more similar to the personas you encounter when reading webtext). So doing RLHF on a larger LM results in getting an imitation of a more realistic useful persona. And for the helpful chatbot persona that Anthropic's language model was imitating, one correlate of being more realistic was preferring not to be shut down. (This doesn't obviously explain the results on sycophancy. I think for that I need to propose a different mechanism, which is that larger LMs were better able to infer their interlocutor's preferences, so that sycophancy only became possible at larger scales. I realize that to the extent this story differs from other stories people tell to explain Anthropic's findings, that means this story gets a complexity penalty.)

I wasn't really focusing on the RL part of RLHF in making the claim that it makes the "agentic personas" problem worse, if that's what you meant. I'm pretty on board with the idea that the actual effects of using RL as opposed to supervised fine-tuning won't be apparent until we use stronger RL or something. Then I expect we'll get even weirder effects, like separate agentic heads or the model itself becoming something other than a simulator (which I discuss in a section of the linked post).

My claim is pretty similar to how you put it - in RLHF as in fine-... (read more)

Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.

I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.

RLHF is less safe tha

... (read more)

I think Janus' post on mode collapse is basically just pointing out that models lose entropy across a wide range of domains. That's clearly true and intentional, and you can't get entropy back just by turning up temperature.  The other implications about how RLHF changes behavior seem like they either come from cherry-picked and misleading examples or just to not be backed by data or stated explicitly.

So, using these models now comes with the risk that when we really need them to work for pretty hard tasks, we don't have the useful safety measures imp

... (read more)
Janus' post on mode collapse is about text-davinci-002, which was trained using supervised fine-tuning on high-quality human-written examples (FeedME []), not RLHF. It's evidence that supervised fine-tuning can lead to weird output, not evidence about what RLHF does. I haven't seen evidence that RLHF'd text-davinci-003 appears less safe compared to the imitation-based text-davinci-002.
3Evan R. Murphy4mo
Glad to see both the OP as well as the parent comment.  I wanted to clarify something I disagreed with in the parent comment as well as in a sibling comment from Sam Marks about the Anthropic paper "Discovering Language Model Behaviors with Model-Written Evaluations" (paper [], post []):   Both of these points seem to suggest that the main takeaway from the Anthropic paper was to uncover concerning behaviours in RLHF language models. That's true, but I think it's just as important that the paper also found pretty much the same concerning behaviours in plain pre-trained LLMs that did not undergo RLHF training, once those models were scaled up to a large enough size. 
3Sam Marks4mo
Regarding your points on agentic simulacra (which I assume means "agentic personas the language model ends up imitating"): 1) My best guess about why Anthropic's model expressed self-preservation desires is the same as yours: the model was trying to imitate some relatively coherent persona, this persona was agentic, and so it was more likely to express self-preservation desires. 2) But I'm pretty skeptical about your intuition that RLHF makes the "imitating agentic personas" problem worse. When people I've spoken to talk about conditioning-based alternatives to RLHF that produce a chatbot like the one in Anthropic's paper, they usually mean either: (a) prompt engineering; or (b) having the model produce a bunch of outputs, annotating the outputs with how much we liked them, retraining the model on the annotated data, and conditioning the model to producing outputs like the ones we most liked. (For example, we could prefix all of the best outputs with the token "GOOD" and then ask the model to produce outputs which start with "GOOD".) Approach (b) really doesn't seem like it will result in less agentic personas, since I imagine that imitating the best outputs will result in imitating an agentic persona just as much as fine-tuning for good outputs with a policy gradient method would. (Main intuition here: the best outputs you get from the pretrained model will already look like they were written by an agentic persona, because those outputs were produced by the pretrained model getting lucky and imitating a useful persona on that rollout, and the usefulness of a persona is correlated with its agency.) I mostly am skeptical that approach (a) will be able to produce anything as useful as Anthropic's chatbot. But to the extent that it can, I imagine that it will do so by eliciting a particular useful persona, which I have no reason to think will be more or less agentic than the one we got via RLHF. Interested to hear if you have other intuitions here.
One consequence downstream of this that seems important to me in the limit: 1. Nonconditioning fine-tuned predictor models make biased predictions. If those biases happen to take the form of a misaligned agent, the model itself is fighting you. 2. Conditioned predictor models make unbiased predictions. The conditioned sequence could still represent a misaligned agent, but the model itself is not fighting you. I think having that one extra layer of buffer provided by 2 is actually very valuable. A goal agnostic model (absent strong gradient hacking) seems more amenable to honest and authentic intermediate reporting and to direct mechanistic interpretation.

Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.

Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.

3Alex Turner5mo
Yeah, IMO "RL at scale trains search-based mesa optimizers" hypothesis predicts "solving randomly generated mazes via a roughly unitary mesa objective and heuristic search" with reasonable probability, and that seems like a toy domain to me. 

This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.

I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.

1Stephen Casper2mo
In general, I think not. The agent could only make this actively happen to the extent that their internal activation were known to them and able to be actively manipulated by them. This is not impossible, but gradient hacking [] is a significant challenge. In most learning formalisms such as ERM or solving MDPs, the model's internals are not modeled as a part of the actual algorithm. They're just implementational substrate. 

I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.

I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can tr... (read more)

I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.

In the interest of making my abstract intuition here more precise, a few weird questions:

Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involvin

... (read more)

generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine

The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something.  Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.

it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which s

... (read more)

Sorry for the (very) late reply!

I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here?  If so, I have a comment there about why I think it might not really be qualitatively different.

I think my picture is slightly different for how self-fulfilling prophecies could occur.  For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the out... (read more)

Sorry for the (very) late reply!

Do you have a link to the ELK proposal you're referring to here?

Yep, here.  I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers... (read more)

Sorry for the (very) late reply!

I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a ... (read more)

Thanks for the feedback!

I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.

  • I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
  • That's a ver
... (read more)
2Charlie Steiner1y
Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models. Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that's a way to be accurate while still having degrees of freedom about how to affect the world. I'm saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.

While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it.  Since they were written while reading, a lot of these may be either obvious or nitpick-y.

Footnote 14, page 15:

Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor's Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since

... (read more)

I think I'm missing something with the Löb's theorem example.

If  can be proved under the theorem, then can't  also be proved?  What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?

2Abram Demski2y
Agreed. The asymmetry needs to come from the source code for the agent. In the simple version I gave, the asymmetry comes from the fact that the agent checks for a proof that x>y before checking for a proof that y>x. If this was reversed, then as you said, the Lobian reasoning would make the agent take the 10, instead of the 5. In a less simple version, this could be implicit in the proof search procedure. For example, the agent could wait for any proof of the conclusion x>y or y>x, and make a decision based on whichever happened first. Then there would not be an obvious asymmetry. Yet, the proof search has to go in some order. So the agent design will introduce an asymmetry in one direction or the other. And when building theorem provers, you're not usually thinking about what influence the proof order might have on which theorems are actually true; you usually think of the proofs as this static thing which you're searching through. So it would be easy to mistakenly use a theorem prover which just so happens to favor 5 over 10 in the proof search.