When we have a good understanding of abstraction, it should also be straightforward to recognize when a distribution shift violates the abstraction. In particular, insofar as abstractions are basically deterministic constraints, we can see when the constraint is violated. And as long as we can detect it, it should be straightforward (though not necessarily easy) to handle it.
If we define deceptive alignment relatively broadly as any situation where “the reason the model looks aligned is because it is actively trying to game the training signal for the purpose of achieving some ulterior goal”...
That's "relatively broad"??? What notion of "deceptive alignment" is narrower than that? Roughly that definition is usually my stock example of a notion of deception which is way too narrow to focus on and misses a bunch of the interesting/probable/less-correlated failure modes (like e.g. the sort of stuff in Worlds Where Iterative Design Fails).
Having as concrete as possible a failure mode to work with is, in my opinion, a really important part of being able to do good research ... Even for transparency and interpretability, perhaps the most obvious “work on the unknown unknowns directly” sort of research, I think it's pretty important to have some idea of what we might want to use those sorts of tools for when developing, and working on concrete failure modes is extremely important to that.
This I agree with, but I think it doesn't go far enough. In my software engineering days, one of the main heuristics I recommended was: when building a library, you should have a minimum of three use cases in mind. And make them as different as possible, because the library will inevitably end up being shit for any use case way out of the distribution your three use cases covered.
Same applies to research: minimum of three use cases, and make them as different as possible.
AGI safety researchers should focus (only/mostly) on deceptive alignment
I consider this advice actively harmful, and strongly advise to do the opposite.
Let's start at the top:
By deceptive alignment, I mean an AI system that seems aligned to human observers and passes all relevant checks but is, in fact, not aligned...
So far so good. There's obvious reasons why it would make sense to focus exclusively on cases where an AI seems aligned to human observers and passes all relevant checks while still not actually being aligned. After all, in the cases where we can see the problem, we either fix it or at least iterate until we can't see a problem any more (at which point we have "deception" by this definition).
But then the post immediately jumps to a far narrower definition of "deceptive alignment":
In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.... In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.
In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned.
... In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.
If we look at e.g. the cases in Worlds Where Iterative Design Fails, most of them fit the more-general definition, yet none necessarily involve an AI which is "actively adversarial but pretends not to be". And that's exactly the sort of mistake people make when they focus exclusively on a single failure mode: they end up picturing a much narrower set of possibilities than the argument for focusing on that failure mode actually assumes.
Now, an oversight like that undermines the case for "focus only/mostly on deceptive alignment", but doesn't make it very actively harmful. The reason it's actively harmful is unknown unknowns.
Claim: the single most confident prediction we can make about AGI is that there will be surprises. There will be unknown unknowns. There will be problems we do not currently see coming.
The thing which determines humanity's survival will not be whether we solve alignment in whatever very specific world, or handful of worlds, we imagine to be most probable. What determines humanity's survival will be whether our solutions generalize widely enough to handle the things the world actually throws at us, some of which will definitely be surprises.
How do we build solutions which generalize to handle surprises? Two main ways. First, understanding things deeply and thoroughly enough to enumerate every single assumption we've made within some subcomponent (i.e. mathematical proofs). That's great when we can do it, but it will not cover everything or even most of the attack surface in practice. So, the second way to build solutions which generalize to handle surprises: plan for a wide variety of scenarios. Planning for a single scenario - like e.g. an inner agent emerging during training which is actively adversarial but pretends not to be - is a recipe for generalization failure once the worlds starts throwing surprises at us.
At this point I'm going to speculate a bit about your own thought process which led to this post; obviously such speculation can easily miss completely and you should feel free to tell me I'm way off.
First, I notice that nowhere in this post do you actually compare deceptive alignment to anything else I'd consider an important-for-research alignment failure mode (like fast takeoff, capability gain in deployment, getting what we measure, etc). You just argue that (a rather narrow version of) deception is important, not that it's more important than any of the other failure modes I actually think about. I also notice in the "implications" section:
I should ask “how does this help with deceptive alignment?” before starting a new project. In retrospect, I haven’t done that a lot and I think most of my projects, therefore, have not contributed a lot to this question.
What this sounds like to me is that you did not previously have any realistic model of how/why AI would be dangerous. I'm guessing that you were previously only thinking about problems which could be fixed by iterative design - i.e. seeing what goes wrong and then updating the design accordingly. Probably (a narrow version of) deception is the first scenario where you've realized that doesn't work, and you haven't yet thought of other ways for an iterative design cycle to fail to produce aligned AI.
So my advice would be to brainstorm other things "deception" (in the most general sense) could look like, or other ways the iterative design cycle could fail to produce aligned AI, and try to aim your brainstorming at scenarios which are as different as possible from the things you've already thought of.
Yeah, I'm familiar with privileged bases. Once we generalize to a whole privileged coordinate system, the RELUs are no longer enough.
Isotropy of the initialization distribution still applies, but the key is that we only get to pick one rotation for the parameters, and that same rotation has to be used for all data points. That constraint is baked in to the framing when thinking about privileged bases, but it has to be derived when thinking about privileged coordinate systems.
I chose the "train a shoulder advisor" framing specifically to keep my/Eliezer's models separate from the participants' own models. And I do think this worked pretty well - I've had multiple conversations with a participant where they say something, I disagree with it, and then they say "yup, that's what my John model said" - implying that they did in fact disagree with their John model. (That's not quite direct evidence of maintaining a separate ontology, but it's adjacent.)
I don't trust my memory to be very reliable here, but here's the path of adjacent ideas which I remember.
I was thinking about a CIRL-style setup. At a high level, the AI receives some messages, it has a prior that the messages were chosen by an agent (i.e. a human) to optimize for some objective, and then the AI uses that info to back out the objective. And I was thinking about how to reconcile this with embeddedness - e.g. if the "agent" is a human, the AI could model it as a system of atoms, and then how does it assign an "objective" to that system of atoms? It might think the system is optimizing for physical action or physical entropy - after all, the system's messages definitely locally maximize those things! Or maybe the AI ends up identifying the entire process of evolution as an "agent", and thinks the messages are chosen (by an imperfect evolutionary optimizer) to maximize fitness. So there's this problem where we somehow need to tell the AI which level of abstraction to use for thinking of the physical system as an "agent", because it can recognize different optimization objectives at different levels.
That was the first time I remember thinking of entropy maximization as sort-of-like an outer optimization objective. And I was already thinking about things like bacteria as agents (even before thinking about alignment), so naturally the idea carried back over to that setting: to separate objective-of-bacteria from objective-of-entropy-maximization or objective-of-evolution or whatever, we need to talk about levels of abstraction and different abstract models of the same underlying system.
After that, I connected the idea to other places. For instance, when thinking about inner misalignment, there's an intuition that embedded inner agents are selected to actively optimize against the outer objective in some sense, because performance-on-the-outer-objective is a scarce resource which the inner agent wants to conserve. And that intuition comes right out of thinking about a bacteria as an embedded inner optimizer in an environment which maximizes physical entropy.
I was assuming manual setup. I don't expect these things to show up spontaneously in the wild, so I'm not actually that interested in doing it on a normal task as a demonstration of a risk mode. Though of course if someone did find a gradient hacker on a normal task, that would be a big update.
Betting markets on these questions would be nice. I'd bid pretty strongly on "nope, basically no path dependence" for most current architectures; replicability already gives us a ton of bits on the question.
This was a cool post, I found the core point interesting. Very similar to gradient hacker design.
As a general approach to avoiding value drift, it does have a couple very big issues (which I'm guessing TurnTrout already understands, but which I'll point out for others). First very big issue: it requires the agent basically decouple its cognition from reality when the relevant reward is applied. That's only useful if the value-drift-inducing events only occur once in a while and are very predictable. If value drift just occurs continuously due to everyday interactions, or if it occurs unpredictably, then the strategy probably can't be implemented without making the agent useless.
Second big issue: it only applies to reward-induced value drift within an RL system. That's not the only setting in which value drift is an issue - for instance, MIRI's work on value drift focused mainly on parent-child value drift in chains of successor AIs. Value drift induced by gradual ontology shifts is another example.
Do you already have a cached notion of why being able to build X involves qualitatively deeper understanding than having a bunch of examples of X on hand? Also, do you already have cached the idea that we currently do not know how to build embedded agents except in simple cases where one or more of the key problems don't apply?