Internal Interfaces Are a High-Priority Interpretability Target

AI ALIGNMENT FORUM
AF

Internal Interfaces Are a High-Priority Interpretability Target — AI Alignment Forum

tl;dr: ML models, like all software, and like the NAH would predict, must consist of several specialized "modules". Such modules would form interfaces between each other, and exchange consistently-formatted messages through these interfaces. Understanding the internal data formats of a given ML model should let us comprehend an outsized amount of its cognition, and allow to flexibly interfere in it as well.

1. A Cryptic Analogy

Let's consider three scenarios. In each of them, you're given the source code of a set of unknown programs, and you're tasked with figuring out their exact functionality. Details vary:

In the first scenario, the programs are written in some known programming language, e. g. Python.
In the second scenario, the programs were randomly generated by perturbing machine code until it happened to end up in a configuration that, when ran, instantiates a process externally indistinguishable from a useful intelligently-written program.
In the third scenario, the programs are written in a programming language that's completely unfamiliar to you (or to anyone else).

In the first scenario, the task is all but trivial. You read the source code, make notes on it, run parts of it, and comprehend it. It may not be quick, but it's straightforward.

The second scenario is a nightmare. There must be some structure to every program's implementation — something like the natural abstraction hypothesis must still apply, there must be modules in this mess of a code that can be understood separately, etc. There is some high-level structure that you can parcel out into tiny pieces that can fit into a human mind. The task is not impossible.

But suppose you've painstakingly reverse-engineered one of the programs this way. You move on to the next one, and... Yup, you're essentially starting from zero. Well, you've probably figured out something about natural abstractions when working on your first reverse-engineering, so it's somewhat easier, but still a nightmare. Every program is structured in a completely different way, you have to start from the fundamentals every time.

The third scenario is a much milder nightmare. You don't focus on reverse-engineering the programs here — first, you reverse-engineer the programming language. It's a task that may be as complex as reverse-engineering one of the individual programs from (2), but once you've solved it, you're essentially facing the same problem as in (1) — a problem that's comparably trivial.

The difference between (2) and (3) is:

In (3), every program has a consistent high-level structure, and once you've figured it out, it's a breeze.
In (2), every high-level structure is unique, and you absolutely require some general-purpose tools for inferring high-level structures.

(2) is a parallel to the general problem of interpretability, different programs being different ML models. Is there some interpretability problem that's isomorphic to (3), however?

I argue there is.

2. Interface Theory

Suppose that we have two separate entities with different specializations: they both can do some useful "work", but there are some types of work that only one of them can perform. Suppose that they want to collaborate: combine their specializations to do tasks neither entity can carry out alone. How can they do so?

For concreteness, imagine that the two entities are a Customer, which can provide any resource from the set of resources , and an Artist, which can make any sculpture from some set $S$ given some resource budget. They want to "trade": there's some sculpture $s_{x}$ the Customer wants made, and there are some resources $r_{x}$ the Artist needs to make it. How can they carry out such exchanges?

The Customer needs to send some message $M_{s_{x}}$ to the Artist, such that it would uniquely identify $s_{x}$ among the members of $S$ . The Artist, in turn, would need to respond with some message $M_{r_{x}}$ , which uniquely identifies $r_{x}$ in the set $R$ .

That implies an interface: a data channel between the two entities, such that every message passed along this channel from one entity to another uniquely identifies some action the receiving entity must take in response.

This, in turn, implies the existence of interpreters: some protocols that take in a message received by an entity, and map it onto the actions available to that entity (in this case, making a particular sculpture or passing particular resources).

For example, if the sets of statues and resources are sufficiently small, both entities can agree on some orderings of these sets, and then just pass numbers. "1" would refer to "statue 1" and "resource package 1", and so on. These protocols would need to be obeyed consistently: they'd have to always use the same ordering, such that the Customer saying "1" always means the same thing. Otherwise, this system falls apart.

Now, suppose that a) the sets are really quite large, such that sorting them would take a very long time, yet b) members of sets are modular, in the sense that each member of a set is a composition of members of smaller sets. E. g., each sculpture can be broken down into the number of people to be depicted in it, the facial expressions each person has, what height each person is, etc.

In this case, we might see this modularity reflected in the messages. Instead of specifying statues/resources holistically, they'd specify modules: "number of people: N, expression on person 1: type 3...", et cetera.

To make use of this, the Artist's interpreter would need to know what "expression on person 1" translates to, in terms of specifications-over- $S$ . And as we've already noted, this meaning would need to be stable across time: "expression on person 1" would need to always mean the same thing, from request to request.

That, in turn, would imply data formats. Messages would have some consistent structure, such that, if you knew the rules by which these structures are generated, you'd be able to understand any message exchanged between the Customer and the Artist at any point.

3. Internal Interfaces

Suppose that we have a ML model with two different modules, each of them specialized for performing different computations. Inasmuch as they'll work together, we'll see the same dynamics between them as between the Customer and the Artist: they'll arrive at some consistent data formats for exchanging information. They'll form an internal interface.

Connecting this with the example from Section 1, I think this has interesting implications for interpretability. Namely: internal interfaces are the most important interpretability targets.

Why? Two reasons:

Cost-efficiency. Understanding a data format would allow us to understand everything that can be specified using this data format, which may shed light on entire swathes of a model's cognition. And as per the third example in Section 1, it should be significantly easier than understanding an arbitrary feature, since every message would follow the same high-level structure.

Editability. Changing a ML model's weights while causing the effects you want and only the effects you want is notoriously difficult — often, a given feature is only responsible for a part of the effect you're ascribing to it, or it has completely unexpected side-effects. An interface, however, is a channel specifically built for communication between two components. It should have minimal side-effects, and intervening on it should have predictable results.

As far as controlling the AI's cognition goes, internal interfaces are the most high-impact points.

4. What Internal Interfaces Can We Expect?

This question is basically synonymous with "what modules can we expect?". There are currently good arguments for the following:

The world-model (WM), i. e. a repository of abstractions.
A set of contextually-activated heuristics/shards.
A "planner"/method of general-purpose search (GPS).

Here's a breakdown of the potential interfaces as I see them:

WM → Shards. Each shard would learn to understand some part of the world-model, and that's what would provide the "context" for their activations.
- However, I expect this is less of an interface, and more of a set of interfaces. I expect that each shard learns to interface with the WM in its own way, so they wouldn't follow consistent input formats.
Shards → Shards: a "shard coordination mechanism" (SCM). Shards necessarily conflict as part of their activity, biding for mutually incompatible plans/actions, and there needs to be some mechanism for resolving such conflicts. That mechanism would need to know about any possible conflicts, however, meaning all the shards would need to interface with it and signal it how much each of them wants to fire in a given case.
SCM → GPS. Shards influence and shape plan-making, and I suspect that GPS would appear after the SCM, so the GPS would just be plugged into an already-existing mechanism for summarizing shard activity.
WM → GPS. The GPS is, in a sense, just a principled method for drawing upon the knowledge in the world-model, so interfacing with the WM is its primary function.
GPS → WM. I suspect that another primary function of the GPS is to organize/modify the world-model at runtime, such as by discovering new abstractions or writing new heuristics to it. As such, it'd need write-access as well.
GPS → GPS. The planner would need to understand its own past thoughts, for the purposes of long-term planning. (Though part/most of them might be just written to the world-model directly, i. e. there's no separate format for GPS' plans.)
World → Agent. In a sense, the world-model isn't just a module, it's itself an interface between the ground truth of the world and the decision-making modules of the agent (shards and the GPS)! Thus, we can expect all elements of the world-model (concepts, abstractions...) to be consistently-formatted.

"Cracking" any of these interfaces should give us a lot of control over the ML model: the ability to identify specific shards, spoof or inhibit their activations, literally read off the AI's plans, et cetera.

I'm particularly excited about (7) and (5), though: if the world-model is itself an interface, it must be consistently-formatted in its entirety, and getting the ability to arbitrarily navigate and edit it would, perhaps, allow us to solve the entire alignment problem on the spot. (A post elaborating on this is upcoming.)

(6) is also extremely high-value, but might be much more difficult to decode.

5. Research Priorities

This implies that identifying internal modules in ML models, and understanding what "modularity" even is, is highly important. I'd like to signal-boost the work done by TheMcDouglas, Avery, and Lucius Bushnaq here, which focuses on exactly this.
More broadly, selection theorems focused on constraining an agent's internal structure, and therefore identifying potential modules all agents would have within themselves, are very useful research avenues.
Alternatively, we can think about this from the opposite angle. Instead of wondering what interfaces/modules arise naturally, we can consider how to artificially encourage specific modules/interfaces. As an example, here's an old paper that trains a separate world-model, then a much simpler controller/agent on top of it. Replicating this feat for AGI-level models would be incredibly valuable (although non-trivally difficult).
A yet another perspective: instead of focusing on interfaces, we can try to come up with ways to incentivize consistent internal structures.
- For example, consider shards. While they'd all have interfaces with the WM (inputs) and the SCM (outputs), there's no pressure at all for them to have the same internal structure. Coming back to Section 1, they'd resemble randomly-generated programs (scenario 2), not programs written in some consistent (if unfamiliar to us) programming language (scenario 3).
- Is there a way to change this? To somehow encourage the ML model to give (a subset of) its components the same high-level structure, so that we may understand all of them in one fell swoop?
In addition: Interpreting an AGI-level model in an ad-hoc "manual" manner, feature-by-feature, as we do it now, would obviously be a hopeless task. Too many features. One way of dealing with this is to build general-purpose tools for discovering high-level structures. But leveraging interfaces potentially allows a different path: a way of understanding vast chunks of a model's cognition using only ad-hoc interpretability methods.
- I wouldn't actually place my hopes on this alternative — we'd essentially need to wait until we actually have the AGI at our hands to even begin this work, and it'll probably be too late by then. But if we somehow fail to get said general-purpose tools for discovering structure, it may not be totally hopeless.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

16

Internal Interfaces Are a High-Priority Interpretability Target

16

1. A Cryptic Analogy

2. Interface Theory

3. Internal Interfaces

4. What Internal Interfaces Can We Expect?

5. Research Priorities