My take on the scaled-up models exhibiting the same behaviours feels more banal - larger models are better at simulating agentic processes and their connection to self-preservation desires etc, so the effect is more pronounced. Same cause, different routes getting there with RLHF and scale.
Thanks for this post! I wanted to write a post about my disagreements with RLHF in a couple weeks, but your treatment is much more comprehensive than what I had in mind, and from a more informed standpoint.
I want to explain my position on a couple points in particular though - they would've been a central focus of what I imagined my post to be, points around which I've been thinking a lot recently. I haven't talked to a lot of people about this explicitly so I don't have high credence in my take, but it seems at least worth clarifying.
RLHF is less safe tha
Do you think the default is that we'll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can't really be identified mechanistically as such, or that only processes where they're really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies - figuring out all of them seems pretty useful.
Yeah, this is definitely something I consider plausible. But I don't have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I'm planning on working on.
This is cool! Ways to practically implement something like RAT felt like a roadblock in how tractable those approaches were.
I think I'm missing something here: Even if the model isn't actively deceptive, why wouldn't this kind of training provide optimization pressure toward making the Agent's internals more encrypted? That seems like a way to be robust against this kind of attack without a convenient early circuit to target.
I think OpenAI's approach to "use AI to aid AI alignment" is pretty bad, but not for the broader reason you give here.
I think of most of the value from that strategy as downweighting probability for some bad properties - in the conditioning LLMs to accelerate alignment approach, we have to deal with preserving myopia under RL, deceptive simulacra, human feedback fucking up our prior, etc, but there's less probability of adversarial dynamics from the simulator because of myopia, there are potentially easier channels to elicit the model's ontology, we can tr... (read more)
I like this post! It clarifies a few things I was confused on about your agenda and the progress you describe sounds pretty damn promising, although I only have intuitions here about how everything ties together.
In the interest of making my abstract intuition here more precise, a few weird questions:
Put all that together, extrapolate, and my 40% confidence guess is that over the next 1-2 years the field of alignment will converge toward primarily working on decoding the internal language of neural nets. That will naturally solidify into a paradigm involvin
generate greentexts from the perspective of the attorney hired by LaMDA through Blake Lemoine
The complete generated story here is glorious, and I think might deserve explicit inclusion in another post or something. Though I think that of the other stories you've generated as well, so maybe my take here is just to have more deranged meta GPT posting.
it seems to point at an algorithmic difference between self-supervised pretrained models and the same models after a comparatively small amount optimization from the RLHF training process which s
Sorry for the (very) late reply!
I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.
I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the out... (read more)
Do you have a link to the ELK proposal you're referring to here?
Yep, here. I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.
"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers... (read more)
I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a ... (read more)
Thanks for the feedback!
I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.
While reading through the report I made a lot of notes about stuff that wasn't clear to me, so I'm copying here the ones that weren't resolved after finishing it. Since they were written while reading, a lot of these may be either obvious or nitpick-y.
Footnote 14, page 15:
Though we do believe that messiness may quantitatively change when problems occur. As a caricature, if we had a method that worked as long as the predictor's Bayes net had fewer than 109 parameters, it might end up working for a realistic messy AI until it had 1012 parameters, since
I think I'm missing something with the Löb's theorem example.
If [A()=5⇒U()=5]∧[A()=10⇒U()=0] can be proved under the theorem, then can't [A()=10⇒U()=10]∧[A()=5⇒U()=0] also be proved? What's the cause of the asymmetry that privileges taking $5 in all scenarios where you're allowed to search for proofs for a long time?