My name is Alex Turner. I'm a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com
I didn't leave it as a "simple" to-do, but rather an offer to collaboratively hash something out.
That said: If people don't even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn't "want things in a behavioralist sense"? (I bet he's looking at something real to him, and I bet he could figure it out if he tried!)
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
To be clear, I'm not talking about formalizing the boundary. I'm talking about a bet between people, adjudicated by people.
(EDIT: I'm fine with a low sensitivity, high specificity outcome -- we leave it unresolved if it's ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how "behaviorally-wanting" the system seemed on a Likert scale. I don't think you need fundamental insights for that to work.)
Closest to the third, but I'd put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.
(I think you're still playing into an incorrect frame by talking about "simplicity" or "speed biases.")
I mean, certainly there is a strong pressure to do well in training—that's the whole point of training.
I disagree. This claim seems to be precisely what my original comment was critiquing:
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of "in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. 'try to do well' or reason about its own training process in order to get a low enough loss, otherwise the model gets 'selected against'."
And then you wrote, as some things you believe:[1]
The model needs to figure out how to somehow output a distribution that does well in training...
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases)...
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
This is the kind of claim I was critiquing in my original comment!
but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.
My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
Thanks for writing out a bunch of things you believe, by the way! That was helpful.
We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff.
Actually, I've thought more, and I don't think that this dual-optimization perspective makes it better. I deny that we "know" that SGD is selecting on that combination, in the sense which seems to be required for your arguments to go through.
It sounds to me like I said "here's why you can't think about 'what gets low loss'" and then you said[1] "but what if I also think about certain inductive biases too?" and then you also said "we know that it's OK to think about it this way." No, I contend that we don't know that. That was a big part of what I was critiquing.
As an alert—It feels like your response here isn't engaging with the points I raised in my original comment. I expect I talked past you and you, accordingly, haven't seen the thing I'm trying to point at.
this isn't a quote, this is just how your comment parsed to me
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
I agree that conditional on entraining consequentialist cognition which has a "different goal" (as thought of by MIRI; this isn't a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.
I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in LLMs to begin with. An example alternative prediction is:
LLMs will continue doing what they're told. They learn contextual goal-directed behavior abilities, but only apply them narrowly in certain contexts for a range of goals (e.g. think about how to win a strategy game). They also memorize a lot of random data (instead of deriving some theory which simply explains its historical training data a la Solomonoff Induction).
Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn't pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.
Why should we believe the "consistent-across-situations inner goals -> deceptive alignment" mechanistic claim about how SGD works? Here are the main arguments I'm aware of:
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI[2] which basically does what you say.[3] I think that world's a lot friendlier, even though there are still some challenges I'm worried about -- like an AI being scaffolded into pursuing consistent goals. (I think that's a very substantially different risk regime, though)
(Even though this predicted mechanistic structure doesn't have any apparent manifestation in current reality.)
Tool AI which can be purposefully scaffolded into agentic systems, which somewhat handles objections from Amdahl's law.
This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.
It could learn to generalize based on color or it could learn to generalize based on shape. And which one we get is just a question of which one is simpler and easier for gradient descent to implement and which one is preferred by inductive biases, they both do equivalently well in training, but you know, one of them consistently is always the one that gradient descent finds, which in this situation is the color detector.
As an aside, I think this is more about data instead of "how easy is it to implement." Specifically, ANNs generalize based on texture because of the random crop augmentations. The crops are generally so small that there isn't a persistent shape during training, but there is a persistent texture for each class, so of course the model has to use the texture. Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.
However, if the crops are made more "natural" (leaving more of the image intact, I think), then classes do tend to have persistent shapes during training. Accordingly, networks reliably learn to generalize based on shapes (just like people do!).
models where the reason it was aligned is because it’s trying to game the training signal.
Isn't the reason it's aligned supposed to be so that it can pursue its ulterior motive, and if it looks unaligned the developers won't like it and they'll shut it down? Why do you think the AI is trying to game the training signal directly, instead of just managing the developers' perceptions of it?
Insofar as you're arguing with me for posting this, I... never claimed that that wasn't true?