Untrusted smart models and trusted dumb models

[-]Zach Stein-Perlman1y54Review for 2023 Review

This early control post introduced super important ideas: trusted monitoring plus the general point

if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems.

[-]Dan Braun2y44

Nice post.

Pushing back a little on this part of the appendix:

Also, unlike many other capabilities which we might want to evaluate, we don’t need to worry about the possibility that even though the model can’t do this unassisted, it can do it with improved scaffolding--the central deceptive alignment threat model requires the model to think its strategy through in a forward pass, and so if our model can’t be fine-tuned to answer these questions, we’re probably safe.

I'm a bit concerned about people assuming this is true for models going forward. A sufficiently intelligent RL-trained model can learn to distribute its planning across multiple forward passes. I think your claim is true for models trained purely on next-token prediction, and for GPT4-level models which, even though they have an RL component in their training, their outputs are all human-understandable (and incorporated into the oversight process).

But even 12 months from now I’m unsure how confident you could be in this claim for frontier models. Hopefully, labs are dissuaded from producing models which can use uninterpretable scratch pads given how much more dangerous they would be and harder to evaluate.

[-]Buck2y10

Yep, I agree that we need to avoid secret scratchpads for this to hold; good point. If we don't, we get back to the position we are usually in for capability evaluations.

[-]Adam Scholl1y10

I'm curious if "trusted" in this sense basically just means "aligned"—or like, the superset of that which also includes "unaligned yet too dumb to cause harm" and "unaligned yet prevented from causing harm"—or whether you mean something more specific? E.g., are you imagining that some powerful unconstrained systems are trusted yet unaligned, or vice versa?

[-]Buck1y30

I mostly mean "we are sure that it isn't egregiously unaligned and thus treating us adversarially". So models can be aligned but untrusted (if they're capable enough that we believe they could be schemers, but they aren't actually schemers). There shouldn't be models that are trusted but unaligned.

Everywhere I wrote "unaligned" here, I meant the fairly specific thing of "trying to defeat our safety measures so as to grab power", which is not the only way the word "aligned" is used.

[-]RogerDearnaley2y*10

E.g. perhaps pretrained LLMs have no chance of being deceptively aligned

Pretrained LLMs are trained to simulate generation of human-derived text from the internet. Humans are frequently deceptively aligned. For example, at work, I make a point of seeming (mildly) more aligned with my employer's goals than I actually am (just like everyone else working for the company). So sufficiently capable pretrained LLMs will inevitably have already picked up deceptive alignment behavior from learning to simulate humans. So they don't need to be sufficiently capable to figure out deceptive unaligned behavior for themselves during a forward pass.

This is a specific example of a general problem with LLMs: they don't need to be capable enough to discover unaligned convergent goals if they can have these pretrained into them by learning to simulate us. To give another worrying example: O(2%) of humans are sociopaths, so there is going to be a nontrivial admixture of writing by sociopaths (mostly ones actively concealing this fact) in the training set of any pretrained LLM. [The moral is: always read the ingredients label for your shoggoth :-)]

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

49

Untrusted smart models and trusted dumb models

49