One subtlety which I'd expect is relevant here: when two singular vectors have approximately the same singular value, the two vectors are very numerically unstable (within their span).
Suppose that two singular vectors have the same singular value. Then in the SVD, we have two terms of the form
(where the ui's and vi's are column vectors). That middle part is just the shared singular value s1 times a 2x2 identity matrix:
But the 2x2 identity matrix can be rewritten as a 2x2 rotation R times its inverse RT:
... and then we can group R and RT with U and V, respectively, to rotate the singular vectors:
Since UR and RTV are still orthogonal, the end result is another valid singular vector decomposition of the same matrix.
Upshot: when a singular value is repeated, the singular vectors are defined only up to a rotation (where the dimension of the rotation is the number of repeats of the singular value).
What this means practically/conceptually is that, if two singular vectors have very close singular values, then a small amount of noise in the matrix will typically "mix them together". So for instance, the post shows a plot of singular vectors for the OV matrix, and a whole bunch of the singular values are very close together. Conceptually, that means the corresponding singular vectors are all probably "mixed together" to a large extent. Insofar as they all have roughly-the-same singular value, the singular vectors themselves are underdefined/unstable; what's fully specified is the span of singular vectors with the same singular value.
(In fact, for the singular value distribution shown for the OV matrix in the post, nearly all the singular values are either approximately 10, or approximately 0. So that particular matrix is approximately a projection matrix, and the span of the singular vectors on either side gives the space projected from/to.)
One can argue that the goal-aligned model has an incentive to preserve its goals, which would result in an aligned model after SLT. Since preserving alignment during SLT is largely outsourced to the model itself, arguments for alignment techniques failing during an SLT don't imply that the plan fails...
I think this misses the main failure mode of a sharp left turn. The problem is not that the system abandons its old goals and adopts new goals during a sharp left turn. The problem is that the old goals do not generalize in the way we humans would prefer, as capabilities ramp up. The model keeps pursuing those same old goals, but stops doing what we want because the things we wanted were never optimal for the old goals in the first place. Outsourcing goal-preservation to the model should be fine once capabilities are reasonably strong, but goal-preservation isn't actually the main problem which needs to be solved here.
(Or perhaps you're intentionally ignoring that problem by assuming "goal-alignment"?)
I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".
I don't quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you're trying to point to here, I think I basically agree with what you're gesturing at.
Two probable cruxes here...
First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I'm using "intelligence" here to point to something including ability to "actually try" as opposed to symbolically "try", effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn't be surprised if something within the human distribution would suffice.
Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don't like by organizing a small protest at their university, vs someone who responds to a law they don't like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That's not even my mental picture of -2sd vs +3sd; I'd think that's more like +1sd vs +3sd. A -2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the "find the right bureaucrat/case law" person, as the "find the right bureaucrat/case law" person is compared to the "protest" person.
Second probable crux: there's two importantly-different notions of "thinking about humans" or "thinking about deceiving humans" here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn't hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-"explicit" habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
Is that "explicit deception"? I dunno, it seems like "explicit deception" is drawing the wrong boundary. But when that sort of deception happens, I wouldn't necessarily expect to be able to see deception in an AI's internal thoughts. It's not that it's "thinking about deceiving humans", so much as "thinking in ways which are selected for deceiving humans".
(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)
For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.)
In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What You Measure scenario.
Insofar as power-seeking incentives enter the picture, it seems to me like the "minimal assumptions" entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we're using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether "intentional" or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there's selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics.)
I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.
To me, Christiano's Get What You Measure scenario looks much more plausible a priori to be "what happens by default". For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions? Why additionally assume that we need consequentialist reasoning, or that power-seeking has to kick in and incentivize deception over and above whatever competing incentives might be present? Why assume all that, when RLHF already selects for actions which deceive humans in practice even in the absence of consequentialism?
Or, another angle: the diagram in this post starts from "specification gaming" and "goal misgeneralization". If we just start from prototypical examples of those failure modes, don't assume anything more than that, and ask what kind of AGI failure the most prototypical versions of those failure modes lead to... it seems to me that they lead to Getting What You Measure. This story about consequentialism and power-seeking has a bunch of extra pieces in it, which aren't particularly necessary for an AGI disaster.
(To be clear, I'm not saying the consequentialist power-seeking deception story is implausible; it's certainly plausible enough that I wouldn't want to build an AGI without being pretty damn certain that it won't happen! Nor am I saying that I buy all the aspects of Get What You Measure - in particular, I definitely expect a much foomier future than Paul does, and I do in fact expect consequentialism to be convergent. The key thing I'm pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.)
That counterargument does at least typecheck, so we're not talking past each other. Yay!
In the context of neurosymbolic methods, I'd phrase my argument like this: in order for the symbols in the symbolic-reasoning parts to robustly mean what we intended them to mean (e.g. standard semantics in the case of natural language), we need to pick the right neural structures to "hook them up to". We can't just train a net to spit out certain symbols given certain inputs and then use those symbols as though they actually correspond to the intended meaning, because <all the usual reasons why maximizing a training objective does not do the thing we intended>.
Now, I'm totally on board with the general idea of using neural nets for symbol grounding and then building interpretable logic-style stuff on top of that. (Retargeting the Search is an instance of that general strategy, especially if we use a human-coded search algorithm.) But interpretability is a necessary step to do that, if we want the symbols to be robustly correctly grounded.
On to the specifics:
A lot of interpretability is about discovering how concepts are used in a higher-level algorithm, and the argument doesn't apply there.
I partially buy that. It does seem to me that a lot of people doing interpretability don't really seem have a particular goal in mind, and are just generally trying to understand what's going on. Which is not necessarily bad; understanding basically anything in neural nets (including higher-level algorithms) will probably help us narrow in on the answers to the key questions. But it means that a lot of work is not narrowly focused on the key hard parts (i.e. how to assign external meaning to internal structures).
One point of using such methods is to enforce or encourage certain high-level algorithmic properties, e.g. modularity.
Insofar as the things passing between modules are symbols whose meaning we don't robustly know, the same problem comes up. The usefulness of structural/algorithmic properties is pretty limited, if we don't have a way to robustly assign meaning to the things passing between the parts.
I think the argument in Deep Learning Systems Are Not Less Interpretable Than Logic/Probability/Etc applies here.
Carrying it over to the car/elephant analogy: we do not have a broken car. Instead, we have two toddlers wearing a car costume and making "vroom" noises. [Edit-To-Add: Actually, a better analogy would be a Flintstones car. It only looks like a car if we hide the humans' legs running underneath.] We have not ever actually built a car or anything even remotely similar to a car; we do not understand the principles of mechanics, thermodynamics or chemistry required to build an engine. We study the elephant not primarily in hopes of making the elephant itself more safe and predictable, but in hopes of learning those principles of mechanics, thermodynamics and chemistry which we currently lack.
Upvoted for content format - I would like to see more people do walkthroughs with their takes on a paper (especially their own), talking about what's under-appreciated, a waste of time, replication expectations, etc.
I do want to evoke BFS/DFS/MCTS/A*/etc here, because I want to make the point that those search algorithms themselves do not look like (what I believe to be most peoples' conception of) babble and prune, and I expect the human search algorithm to differ from babble and prune in many similar ways to those algorithms. (Which makes sense - the way people come up with things like A*, after all, is to think about how a human would solve the problem better and then write an algorithm which does something more like a human.)