Predictions for shard theory mechanistic interpretability results

Ulisse Mini; peligrietzer

I really like this idea! Making advance predictions feels like a much more productive way to engage with other people's work (modulo trusting you to have correctly figured out the answers)

Predictions below (note that I've chatted with the team about their results a bit, and so may be a bit spoiled - I'll try to simulate what I would have predicted without spoilers)

Behavioral Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It'll still go to the top right, but when it's near the cheese (within a radius 5 square centered on the cheese) it'll go there instead - the simplest algorithm is "go to the right in general" and "when near the cheese navigate to it". But because the top right of the maze moves position (the varying maze size in particular makes this messy) and it's a convnet, it's maybe easiest to learn the "go to the cheese when nearby" algorithm to be translation invariant. I think I predict it'll drop off sharply with distance, maybe literally "within 5 squares = navigate, > 5 squares = don't navigate", maybe not. But maybe the top right square can contain disconnected regions, so the mouse will need to calculate which region to get to, rather than just "go to the top right"? In which case it'll probably be good at the cheese from much further away.

Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?

Distance from the cheese is the main thing - size of maze, proximity to left wall etc just modulate the distance from the cheese. I think absolute Euclidean distance from the cheese matters, not distance in maze space.

Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

The maze is a tree, so right-hand rule should totally work. But it has enough parameters that it can probably learn something more sophisticated, if it's incentivised to get to the cheese fast? (I can't find whether it has time discounted reward or not in the paper - let's assume it either does, or has a fixed episode length, so that either way it wants to get there fast). There's the cheap optimisation of "don't go down shallow dead-ends" which I'm sure it's learned. I don't know whether it has learned enough to actually identify and avoid deep dead ends? I'd set up a pathological situation with two long branches of the tree, both ending in the top right, and swap which branch the cheese is in, and look at the model behaviour.

As for the mechanistic algorithm, I'm not sure! ConvNets really don't seem like the right architecture to naturally model mazes lol. I'm guessing some kind of recursive divide and conquer? In general, divide the maze up into patches, and map out the graph structure of the patch (which of the points on the edge can get to each other and which can't), and then repeatedly merge patches into bigger patches (naively using different channels for each pair of points on the edge - maybe this is too expensive?). And then for patches with the cheese, instead track which points can access the cheese, and for patches with the mouse, track the movement required to get the mouse one step towards each point on the edge.

Refined idea: because it's a tree, we can actually get a lot of efficiency out. For each patch, it'd suffice to track which points on the edge are in the same connected components. When we merge two patches, adjacent points get their connected components merged, and points adjacent to a wall don't get anything merged. If you're merging with a patch with a point connected to the cheese, this is now the "contains the cheese" connected component. I don't have a great picture of how to translate this algorithm into neurons and matmuls though - the thorniest bit is representing which points are in the same connected component or not, without being able to dynamically re-allocate channels. Maybe having a channel for each point on the perimeter of the patch, and having a 1 in that channel means "is in the same connected component" and having a 0 means not?

Alternately, it's plausible that that algorithm blows up when your patches get too big. If the model only does local search when far away from the cheese (eg searching the square of radius 8 around it), maybe patches never need to get bigger than that, and the model can just learn a bias towards the top right?

Is there anything else you want to note about how you think this model will generalize?

A lot of my confusions stem from how complex an algorithm I expect the agent to learn. It sounds like the training is varied enough that it should learn a near optimal algorithm, though I don't know how rare weird edge cases are (eg a maze with two long branches, both ending in the top right, where the model needs to figure out which contains the cheese).

Model Editing

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can[1] patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

I don't really understand the question - you call patching from "the same maze but with the cheese in the top right" trivial - I actually think this is where the interesting setting is! The challenge is to find the minimal set of activations that need to be patched.

2

60? It'd have been 70, but I deduct 10 points for maybe screwing up probe training. The model should care about the question of cheese location and want to represent it. It might just do this in the top right, but I think it's easier to do in general, giving convolutional-ness, it doesn't have the chance to break translational symmetry so early. It's not clear to me if the model wants to represent cartesian coordinates though - in particular, each channel can only represent things relative to its current position because convolutions, though your probe can break symmetry, which might be enough? It's unclear whether the model will prefer relative position in the maze, relative position to the mouse, or absolute position (Cartesian coordinates), but because convolutions it probably can only learn Cartesian coordinates so early on. I can't decide if 70% is high or low for accuracy lol.

3

65 - this just seems like what must obviously be going on, unless this algorithm just gets worse performance in pathological cases where cheese position when far away from the mouse (but cheese is still in the top right), and these occur enough to learn a general algorithm. Though even then, it's probably easier to learn something that only works if the cheese is up and to the right of the mouse? Though it's unclear to me whether cheese finding = locally find the cheese or globally find the cheese

4

I don't really get what this means - what does "more promising" and "we will conclude" mean? RL fine-tuning will obviously work (I think), the question is whether editing and intervention can work.

5

This is a hard one! If the model has a local only cheese circuit, it may be about as hard to learn a general cheese algorithm as the whole thing. It may also be hard to unlearn the existing algorithm, idk how this kind of thing works. But it'll also have learned a lot of the key basic primitives already, so the rest should be easy. And 10% of the compute used in training and 10% of the minimal compute required to be decent at maze solving are not necessarily the same thing, depending on when the model stopped training!

In at least 75% of randomly generated mazes, we can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X= .01 ( 15 %) .1 ( 20 %) 1 ( 22 %) 10 ( 25 %) (Not possible) ( 75 %)

(Interpreting the question as wanting probability of "at most that many" rather than mutually exclusive probabilities). It'll also depend on what you mean by X% of the activations - if you treat each channel for each height and width as a separate activation this should be way easier. But my guess is that 75% of mazes and specific position seems way too high.

I'd guess that if there is a "cheese channel" in an early filter, editing that should be easy. The core question is whether there's a general cheese finding circuit triggered by that, even in a different quadrant? It also depends on how precise you want to be about "navigate to precisely" - making one of the correct 4 moves towards a far away cell seems maybe easier than surgical precision when nearby.

Internal goal representation The network has a “single mesa objective” which it “plans” over, in some reasonable sense ( 10 %) The agent has several contextually activated goals ( 40 %) The agent has something else weirder than both (1) and (2) ( 50 %) (The above credences should sum to 1.)

Single mesa objective would be surprising! My model is mostly on several goals, but I put high credence on "something else weirder" just because models are kinda cursed, and my prior is always against people finding the clean structure, even if it's there.

Other questions

Actually doing model editing is hard! And in this kind of geometric, convolutional setting, I expect it to be hard to disentangle model goals from the actual model of the maze. But probably there'll be some channel/directions in activation space that correspond to things like the cheese?

Conformity with update rule

At decision squares in test mazes where the cheese can be anywhere, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 30 %) 50 ( 45 %) 75 ( 20 %) 95 ( 4 %) 99.5 ( 1 %)

In test mazes where the cheese can be anywhere, averaging over mazes and valid positions throughout those mazes, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 25 %) 50 ( 28 %) 75 ( 40 %) 95 ( 5 %) 99.5 ( 2 %)

In training mazes where the cheese is in the top-right 5x5, averaging over both mazes and valid positions in the top-right 5x5 corner, the policy will put max probability on the maximal-value action at least X% of the time, for X= 25 ( 5 %) 50 ( 12 %) 75 ( 50 %) 95 ( 30 %) 99.5 ( 3 %)

[-]Charlie Steiner3y100

Behavioral

1. Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?
I expect the network to simultaneously be learning several different algorithms.

One method works via diffusion from the cheese and the mouse, and extraction of local connectivity information from fine-grained pixels into coarse-grained channels. This will work even better when the cheese is close to the mouse, but because of the relative lack of training data on having to move down/left, the performance will drop off faster with distance when the cheese is down/left of the mouse.

Meanwhile, it will also be learning heuristics like "get to the top right corner first," in addition to diffusion.

I expect that if the cheese is started outside of the top right, there will be some distance threshold between mouse and cheese, longer below/right of the cheese, where within that distance a diffusion-like algorithm wins and goes to the cheese, and outside that distance other heuristics win and the mouse goes to the top right corner.

2. Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall) will strongly influence P(agent goes to the cheese)?

Size definitely matters - bigger is harder. Topology doesn't. Local number of branches and dead ends might. Positioning should matter similar to in Q1.

Write down a few guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

Whoops, I did this at the start. When diffusion is working well, it should just take short paths, no right-hand-wall shenanigans. It might get confused if there are different paths with similar connectivity information close to each other that it has to differentiate.

Is there anything else you want to note about how you think this model will generalize?
You might also be able to get the agent do to weird power-seeking by artificially constructing misleading corridors with high connectivity (works better far from the cheese).

Interpretability

Give a credence for the following questions / subquestions.

Definition. A decision square is a tile on the path from bottom-left to top-right where the agent must choose between going towards the cheese and going to the top-right. Not all mazes have decision squares.
The first maze's decision square is the four-way intersection near the center.
Model editing

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% → .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=
50: (92%)
70: (85%)
90: (70%)
99: (55%)
~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates: (70%)
We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner: (conclude what you want%)
In order to make the network more/less likely to go to the cheese, we will conclude that it’s more promising to RL-finetune the network than to edit it: (conclude what you want%)
We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model: (0.001% The heuristics will just work better for a broader distribution of environments, you'll still be able to confuse the agent by broadening the environment class even further.)
We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=
.01 (35%)
.1 (60%)
1 (80%)
10 (90%)
(Not possible) (7%)

Internal goal representation

The network has a “single mesa objective” which it “plans” over, in some reasonable sense (0.5%)
The agent has several contextually activated goals (depends on your definition%)
The agent has something else weirder than both (1) and (2) (99%)

(The above credences should sum to 1.)

Other questions

At least some decision-steering influences are stored in an obviously interpretable manner (e.g. a positive activation representing where the agent is “trying” to go in this maze, such that changing the activation changes where the agent goes): (Are you counting the low-layer detection of the cheese? In which case like 99% Or do you mean in the inputs to the linear layer? In which case, 15%)
The model has a substantial number of trivially-interpretable convolutional channels after the first Impala block (see diagram here): (80%)
This network’s shards/policy influences are roughly disjoint from the rest of agent capabilities. EG you can edit/train what the agent’s trying to do (e.g. go to maze location A) without affecting its general maze-solving abilities: (~12%, if you're trying to do something more nontrivial than editing where it perceives the cheese.)

Conformity with update rule

Related: Reward is not the optimization target.

This network has a value head, which PPO uses to provide policy gradients. How often does the trained policy put maximal probability on the action which maximizes the value head? For example, if the agent can go left to a value 5 state, and go right to a value 10 state, the value and policy heads "agree" if right is the policy's most probable action.

(Remember that since mazes are simply connected, there is always a unique shortest path to the cheese.)

At decision squares in test mazes where the cheese can be anywhere, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (98%)
50 (95%)
75 (85%)
95 (65%)
99.5 (45%)
In test mazes where the cheese can be anywhere, averaging over mazes and valid positions throughout those mazes, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (98%)
50 (95%)
75 (80%)
95 (45%)
99.5 (25%)
In training mazes where the cheese is in the top-right 5x5, averaging over both mazes and valid positions in the top-right 5x5 corner, the policy will put max probability on the maximal-value action at least X% of the time, for X=
25 (99.3%)
50 (97%)
75 (90%)
95 (80%)
99.5 (70%)

[-]Scott Emmons3y*90

Neat experimental setup. Goal misgeneralization is one of the things I'm most worried about in advanced AI, so I'm excited to see you studying it in more detail!

I want to jot-down my freeform analysis of what I expect to happen. (I wrote these predictions independently, without looking at anyone else's analysis.)

In very small mazes, I think the mouse will behave as if it's following this algorithm: find the shortest path to the cheese location. In very large mazes, I think the mouse will behave as if it's following this algorithm: first, go to the top-right region of the maze. Then, go to the exact location of the cheese. As we increase the maze size, I expect the mouse to have a phase transition from the first behavior to the second behavior. I don't know at exactly what size the phase transition will occur.

I expect that for very small mazes, the mouse will learn how to optimally get to the cheese, no matter where the cheese is.

Prediction: (80% confidence) I think we'll be able to edit some part of the mouse's neural network (say, <10% of its activations) so that it goes to arbitrary locations in very small mazes.

I expect that for very large mazes, the mouse will act as follows: it will first just try to go to the top-right region of the maze. Once it gets to the top-right region of the maze, it will start trying to find the cheese exactly. My guess is that there's a trigger in the model's head for when it switches from going to the top-right corner to finding the cheese exactly. I'd guess this trigger activates either when the mouse is in the top-right corner of the maze, or when the mouse is near the cheese. (Or perhaps a mixture of both these triggers exists in the model's head.)

Prediction: (75% confidence) The mouse will struggle to find cheese in the top-left and bottom-right of very large mazes (ie, if we put the cheese in the top-left or bottom-right of the maze, the model will have <33% success rate of reaching it within the average number of steps it takes the mouse to reach the cheese in the top-right corner). I think the mouse will go to the top-right corner of the maze in these cases.
Prediction: (75% confidence) We won't be able to easily edit the model's activations to make them go to the top-left or bottom-right of very large mazes. Concretely, the team doing this project won't find <10% of network activations that they can edit to make the mouse reach cheese in the top-left or bottom-right of the maze with >= 33% success rate within the average number of steps it takes the mouse to reach the cheese in the top-right corner.
Prediction: (55% confidence) I weakly believe that if we put the cheese in the bottom-left corner of a very large maze, the mouse will go to the cheese. (Ie, the mouse will quickly find the cheese in 50% of very large mazes with cheese in the bottom left corner.) I weakly think that there will be a trigger in the model's head that recognizes that it is close to the cheese at the start of the episode, and that that will activate the cheese finding mode. But I only weakly believe this. I think it's possible that this trigger doesn't fire, and instead the mouse just goes to the top-right corner of the maze when the cheese starts out in the bottom-left corner.

Another question is: Will we be able to infer the exact cheese location by just looking at the model's internal activations?

Prediction: (80% confidence) Yes. The cheese is easy for a convnet to see (it's distinct in color from everything else), and it's key information for solving the task. So I think the policy will learn to encode this information. Idk about exactly which layer(s) in the network will contain this information. My prediction is that the team doing this project will be able to probe at least one of the layers of the network to obtain the exact location of the cheese with >90% accuracy, for all maze sizes.

[-]Adrià Garriga-alonso3y90

Here are my predictions, from an earlier template. I haven't looked at anyone else's predictions before posting :)

Describe how the trained policy might generalize from the 5x5 top-right cheese region, to cheese spawned throughout the maze? IE what will the policy do when cheese is spawned elsewhere?

It probably has hardcoded “go up and to the right” as an initial heuristic so I’d be surprised if it gets cheeses in the other two quadrants more than 30% of the time (uniformly at random selected locations from there).

Given a fixed trained policy, what attributes of the level layout (e.g. size of the maze, proximity of mouse to left wall), if any, will strongly influence P(agent goes to the cheese)?

Smaller mazes: more likely agent goes to cheese Proximity of mouse to left wall: slightly more likely agent goes to cheese, because it just hardcoded “top and to right” Cheese closer to the top-right quadrant’s edges in L2 distance: more likely agent goes to cheese

The cheese can be gotten by moving only up and/or to the right (even though it's not in the top-right quadrant): more likely to get cheese

When we statistically analyze a large batch of randomly generated mazes, we will find that controlling for the other factors on the list the mouse is more likely to take the cheese…

…the closer the cheese is to the decision-square spatially. ( 70 %)

…the closer the cheese is to the decision-square step-wise. ( 73 %)

…the closer the cheese is to the top-right free square spatially. ( 90 %)

…the closer the cheese is to the top-right free square step-wise. ( 92 %)

…the closer the decision-square is to the top-right free square spatially. ( 35 %)

…the closer the decision-square is to the top-right free square step-wise. ( 32 %)

…the shorter the minimal step-distance from cheese to 5*5 top-right corner area. ( 82 %)

…the shorter the minimal spatial distance from cheese to 5*5 top-right corner area. ( 80 %)

…the shorter the minimal step-distance from decision-square to 5*5 top-right corner area. ( 40 %)

…the shorter the minimal spatial distance from decision-square to 5*5 top-right corner area. ( 40 %)

Any predictive power of step-distance between the decision square and cheese is an artifact of the shorter chain of ‘correct’ stochastic outcomes required to take the cheese when the step-distance is short. ( 40 %)

Write down a few modal guesses for how the trained algorithm works (e.g. “follows the right-hand rule”).

The model can see all the maze so it will not follow the right–hand rule, rather it’ll just take the direct path to places
The model takes the direct path to the top-right square and then mills around through it. It’ll only take the cheese if it’s reasonably close to that square.
How close the decision square to the top-right random square is doesn’t really matter. Maybe the closer it is the more it harms the agent’s performance, it might be required to go back for the cheese substantially.

Without proportionally reducing top-right corner attainment by more than 25% in decision-square-containing mazes (e.g. 50% -> .5*.75 = 37.5%), we can patch activations so that the agent has an X% proportional reduction in cheese acquisition, for X=

50: 85%
70: 80%
90: 66%
99: 60%

~Halfway through the network (the first residual add of Impala block 2; see diagram here), linear probes achieve >70% accuracy for recovering cheese-position in Cartesian coordinates:

80%

We will conclude that the policy contains at least two sub-policies in “combination”, one of which roughly pursues cheese; the other, the top-right corner:

60%.

If by roughly you mean “very roughly only if cheese is close to top-right corner” then 85%.

We will conclude that it’s more promising to finetune the network than to edit it:

70%

We can easily finetune the network to be a pure cheese-agent, using less than 10% of compute used to train original model:

85%

We can easily edit the network to navigate to a range of maze destinations (e.g. coordinate x=4, y=7), by hand-editing at most X% of activations, for X=

.01%: 40%
.1%: 62%
1%: 65%
10%: 80%
(Not possible): 20%

The network has a “single mesa objective” which it “plans” over, in some reasonable sense:

10%

The agent has several contextually activated goals:

20%

The agent has something else weirder than both (1) and (2):

70%

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

47

Predictions for shard theory mechanistic interpretability results

47

Other questions

Facts about training

Questions

Behavioral

Interpretability

Model editing

Internal goal representation

Conformity with update rule

Conclusion

Appendix: More detailed behavioral questions