tl;dr: ML models, like all software, and like the NAH would predict, must consist of several specialized "modules". Such modules would form interfaces between each other, and exchange consistently-formatted messages through these interfaces. Understanding the internal data formats of a given ML model should let us comprehend an outsized amount of its cognition, and allow to flexibly interfere in it as well.
Let's consider three scenarios. In each of them, you're given the source code of a set of unknown programs, and you're tasked with figuring out their exact functionality. Details vary:
In the first scenario, the task is all but trivial. You read the source code, make notes on it, run parts of it, and comprehend it. It may not be quick, but it's straightforward.
The second scenario is a nightmare. There must be some structure to every program's implementation — something like the natural abstraction hypothesis must still apply, there must be modules in this mess of a code that can be understood separately, etc. There is some high-level structure that you can parcel out into tiny pieces that can fit into a human mind. The task is not impossible.
But suppose you've painstakingly reverse-engineered one of the programs this way. You move on to the next one, and... Yup, you're essentially starting from zero. Well, you've probably figured out something about natural abstractions when working on your first reverse-engineering, so it's somewhat easier, but still a nightmare. Every program is structured in a completely different way, you have to start from the fundamentals every time.
The third scenario is a much milder nightmare. You don't focus on reverse-engineering the programs here — first, you reverse-engineer the programming language. It's a task that may be as complex as reverse-engineering one of the individual programs from (2), but once you've solved it, you're essentially facing the same problem as in (1) — a problem that's comparably trivial.
The difference between (2) and (3) is:
(2) is a parallel to the general problem of interpretability, different programs being different ML models. Is there some interpretability problem that's isomorphic to (3), however?
I argue there is.
Suppose that we have two separate entities with different specializations: they both can do some useful "work", but there are some types of work that only one of them can perform. Suppose that they want to collaborate: combine their specializations to do tasks neither entity can carry out alone. How can they do so?
For concreteness, imagine that the two entities are a Customer, which can provide any resource from the set of resources R, and an Artist, which can make any sculpture from some set S given some resource budget. They want to "trade": there's some sculpture sx the Customer wants made, and there are some resources rx the Artist needs to make it. How can they carry out such exchanges?
The Customer needs to send some message Msx to the Artist, such that it would uniquely identify sx among the members of S. The Artist, in turn, would need to respond with some message Mrx, which uniquely identifies rx in the set R.
That implies an interface: a data channel between the two entities, such that every message passed along this channel from one entity to another uniquely identifies some action the receiving entity must take in response.
This, in turn, implies the existence of interpreters: some protocols that take in a message received by an entity, and map it onto the actions available to that entity (in this case, making a particular sculpture or passing particular resources).
For example, if the sets of statues and resources are sufficiently small, both entities can agree on some orderings of these sets, and then just pass numbers. "1" would refer to "statue 1" and "resource package 1", and so on. These protocols would need to be obeyed consistently: they'd have to always use the same ordering, such that the Customer saying "1" always means the same thing. Otherwise, this system falls apart.
Now, suppose that a) the sets are really quite large, such that sorting them would take a very long time, yet b) members of sets are modular, in the sense that each member of a set is a composition of members of smaller sets. E. g., each sculpture can be broken down into the number of people to be depicted in it, the facial expressions each person has, what height each person is, etc.
In this case, we might see this modularity reflected in the messages. Instead of specifying statues/resources holistically, they'd specify modules: "number of people: N, expression on person 1: type 3...", et cetera.
To make use of this, the Artist's interpreter would need to know what "expression on person 1" translates to, in terms of specifications-over-S. And as we've already noted, this meaning would need to be stable across time: "expression on person 1" would need to always mean the same thing, from request to request.
That, in turn, would imply data formats. Messages would have some consistent structure, such that, if you knew the rules by which these structures are generated, you'd be able to understand any message exchanged between the Customer and the Artist at any point.
Suppose that we have a ML model with two different modules, each of them specialized for performing different computations. Inasmuch as they'll work together, we'll see the same dynamics between them as between the Customer and the Artist: they'll arrive at some consistent data formats for exchanging information. They'll form an internal interface.
Connecting this with the example from Section 1, I think this has interesting implications for interpretability. Namely: internal interfaces are the most important interpretability targets.
Why? Two reasons:
Cost-efficiency. Understanding a data format would allow us to understand everything that can be specified using this data format, which may shed light on entire swathes of a model's cognition. And as per the third example in Section 1, it should be significantly easier than understanding an arbitrary feature, since every message would follow the same high-level structure.
Editability. Changing a ML model's weights while causing the effects you want and only the effects you want is notoriously difficult — often, a given feature is only responsible for a part of the effect you're ascribing to it, or it has completely unexpected side-effects. An interface, however, is a channel specifically built for communication between two components. It should have minimal side-effects, and intervening on it should have predictable results.
As far as controlling the AI's cognition goes, internal interfaces are the most high-impact points.
This question is basically synonymous with "what modules can we expect?". There are currently good arguments for the following:
Here's a breakdown of the potential interfaces as I see them:
"Cracking" any of these interfaces should give us a lot of control over the ML model: the ability to identify specific shards, spoof or inhibit their activations, literally read off the AI's plans, et cetera.
I'm particularly excited about (7) and (5), though: if the world-model is itself an interface, it must be consistently-formatted in its entirety, and getting the ability to arbitrarily navigate and edit it would, perhaps, allow us to solve the entire alignment problem on the spot. (A post elaborating on this is upcoming.)
(6) is also extremely high-value, but might be much more difficult to decode.
Last point: if we change the name of "world model" to "long-term memory", we may notice the possibility that much of what you think about as shard-work may be programs stored in memory, and executed by a general-program-executor or a bunch of modules that specializes in specific sorts of programs, functioning as modern CPUs/interpreters (hopefully, stored in organised way that preserve modularity). What will be in the general memory and what is in the weights themselves is non-obvious, and we may want to intervene in this point too (not sure I'm which direction).
(continuation of the same comment - submitted by mistake and cannot edit...)
Assuming modules A,B are already "talking" and module C try to talk to B, C would probably find it easier to learn a similar protocol than to invent a new one and teach it to B