The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.
I agree with this claim (anything inherently interpretable in the conventional seems totally doomed). I do want to push back on an implicit vibe of "these models are hard to interpret because of the domain, not because of the structure" though - interpretability is really fucking hard! It's possible, but these models are weird and cursed and rife with bullshit like superposition, it's hard to figure out how to break them down into interpretable units, they're full of illusions, and confused and janky representations, etc.
I don't really have a coherent and rigorous argument here, I just want to push back on the vibe I got from your "interpretability has many wins" section - it's really hard!
I like this post overall. I'd guess that connectionism is the real-world way to do things, but I don't think I'm quite as confident that there aren't alternatives. I basically agree with the main points, though.
to mind control a [maze-solving agent] to pursue whatever you want with just a single activation
I want to flag that this isn't quite what we demonstrated. It's overselling our results a bit. Specifically, we can't get the maze-solving agent to go "wherever we want" via the demonstrated method. Rather, in many mazes, I'd estimate we can retarget the agent to end up in about half of the maze locations. Probably a slight edit would fix this, like changing "whatever you want" to "a huge range of goals."
It has become common on LW to refer to "giant inscrutable matrices" as a problem with modern deep-learning systems.
To clarify: deep learning models are trained by creating giant blocks of random numbers -- blocks with dimensions like 4096 x 512 x 1024 -- and incrementally adjusting the values of these numbers with stochastic gradient descent (or some variant thereof). In raw form, these giant blocks of numbers are of course completely unintelligible. Many hold that the use of such giant SGD-trained blocks is why it is hard to understand or to control deep learning models, and therefore we should seek to make ML systems from other components.
There are several places where Yudkowsky or others state or strongly imply that because SGD-trained models with huge matrices are unintelligible, we should seek some more easily-interpretable paradigm.
I'm going to argue against that. I think that a better alternative is probably not possible; that the apparent inscrutability of these models actually has little-to-nothing to do with deep learning; and finally that this language -- particularly to the non-rationalist -- suggests unwarranted mystery.
0: It Is Probable That Generally Intelligent Systems Must be Connectionist
Imagine a universe in which it is impossible to build a generally intelligent system that is not massively connectionist. That is, imagine a world where the only way to get intelligence from atoms is to have a massive number of simple, uniform units connected to each other -- or something that is a functional equivalent of the same.
In such a world, all smart animals would have become smart by scaling up the number of such units that they have. The dominant evolutionary species might become intelligent by scaling up its head size, despite paying the evolutionary cost of making childbirth dangerous and painful by doing so. Flying species that could not afford this extra weight of scaling up skull volume might take another approach, shrinking their neurons to pack more of them into a given volume. Even animals far distant from the dominant species along the phylogenetic tree and in which the evolution of high levels of intelligence occurred entirely separately, would do so by scaling up their brains.
The dominant species, once it could make information-processing equipment, might try for many years to make some generally intelligent system without massive connectionism. They might scorn connectionism as brute force, or as lacking insight; thousands of PhDs and software engineers would spend time devising specialized systems of image classification, voice transcription, language translation, video analysis, natural language processing, and so on. But once they coded up connectionist software -- then in a handful of years, the prior systems built through hundreds of thousands of hours of effort would fall to simple systems that an undergrad could put together in his spare time. And connectionist systems would quickly vault out from the realm of such prior systems, to build things completely unprecedented to non-connectionist systems.
Of course, such a world would be indistinguishable from our world.
Is this proof that intelligence must be connectionist? Of course not. We still await a Newton who might build a detailed causal model of intelligence, which confirms or refutes the above.
But if the universal failure of nature and man to find non-connectionist forms of general intelligence does not move you, despite searching for millions of years and millions of man-hours -- well, you could be right, but I'd really like to see any predictions an alternate hypothesis makes.
1.a: Among Connectionist Systems That We Know To Be Possible, Synchronous Matrix Operations Are the Most Interpretable
Given that we might even require connectionist systems for general intelligence, what are the most interpretable connectionist systems that we can imagine? What alternatives to matrix multiplications do we know are out there?
Well, our current systems could be more biologically inspired! They could work through spike-timing-dependent-plasticity neurons. We know these are possible, because biological brains exist. But such systems would a nightmare to interpret, because they work asynchronously in time-separated bundles of neuronal firing. Interpreting asynchronous systems is almost always far more difficult than interpreting synchronous systems.
Or the calculations, our connectionist system could take place in non-digital systems! Rather than as arbitrarily transportable digital files, the weights could be stored in actual physical, artificial neurons that implement STDP or backpropagation on an analog device. Or you could use something even more biologically inspred -- something like Peter Watts imagined cloned-neocortex-in-a-pan. But in such a neuron-inspired substrate, it could be a massive undertaking to do something as simple as reading the synapse strength. Once again, interpretability would be harder.
I don't want to claim too much. I don't think current systems are at the theoretical apex of interpretability, not the least because people can suggest ways to make them more interpretable.
But -- of all the ways we know general intelligence can be built, synchronous matrix operations are by far the easiest to understand.
1.b: And the Hard-To-Interpret Part of Matrices Comes From the Domain They Train on, And Not Their Structure
(Even in worlds where the above two points are false, I think this one is still probably true, although it is less likely.)
There are many clear interpretability successes for deep learning.
Small cases of grokking have been successfully reversed engineered. The interpretability team at OpenAI could identify neurons as abstract as the "pokemon" neuron or "Catholicism" neuron two years ago -- the same people now at Anthropic work on transformer circuits. It is possible to modify an LLM so it thinks that the Eiffel tower is in Rome, or to mind control a maze-solving agent to pursue a wide range of goals with just a single activation, which reaches for the summit of interpretability to my mind, because understanding should enable control.
All this and more is true, but still -- the vast majority of weights in larger models like GPT-4 have not been so reverse engineered. Doesn't that point to something fundamentally wrong about the gradient-descent-over-big-matrices-paradigm?
Well, I ask you -- do you think any other ML model, trained over the domain of all human text, with sufficient success to reach GPT-4 level perplexity, would turn out to be simpler?
I propose that the deeply obvious answer, once you ask the question, is that they would not.
ML models form representations suitable to their domain. Image language models build up a hierarchy of feature detectors moving from the simpler to the more complex -- line detectors, curve detectors, eye detectors, face detectors, and so on. But the space of language is larger than the space of images! We can discuss anything that exists, that might exist, that did exist, that could exist, and that could not exist. So no matter what form your predict-the-next-token language model takes, if it is trained over the entire corpus of the written word, the representations it forms will be pretty hard to understand, because the representations encode an entire understanding of the entire world.
So, I predict with high confidence that any ML model that can reach the perplexity levels of Transformers will also present great initial interpretive difficulty.
2: Inscrutability is in Ourselves and Not the Stars
Imagine an astronomer in the year 1600, who frequently refers to the "giant inscrutable movements" of the stars. He looks at the vast tables of detailed astronomical data emerging from Tycho Brahe's observatory, and remark that we might need to seek an entirely different method of understanding the stars, because this does not look promising.
Such an astronomer might be very knowledgeable, and might know by heart the deep truths. Our confusion about a thing is not a part of the thing; it is a feature of our minds and not of the world. Mystery and awe in our understanding of a thing, is not in the thing itself. Nothing is inscrutable. Everything can be understood..
But in speaking of inscrutability to his students or to the less sophisticated, he would not be helping people towards knowledge. And of course, his advice would have been pointing directly away from the knowledge that helped Kepler discover his laws, because Tycho Brahe's plethora of tables directly enabled Kepler.
Fin
So, I think talking about "giant inscrutable matrices" promotes unclear thought confusing map and territory.
The frequently accompanying, action-relevant claim -- that substantially easier-to-interpret alternatives exist -- is probably false and distracts people with fake options. That's my main thesis.
This could be very bad news. Particularly if you're pessimistic about interpretability and have short timelines. Not having easier alternatives to G.I.M. doesn't actually make matrices any easier to interpret. Enormous quantities of intellectual labor have been done, and remain yet to do.
Still. In some of my favorite research shared on LW, some shard researchers speculate about the future, based off their experience of how easy it was to wipe knowledge from a maze-solving algorithm:
Really simple stuff turned out to work for capabilities. It could also work out for interpretability.