If I understand correctly, inactive circuits mistakenly activating is the main failure mode - once this happens, things go downhill quickly. So the bottleneck is robustly knowing which circuits should be active.
Could we use O(d) redundant On-indicators per small circuit, instead of just 1, and apply the 2-ReLU trick to their average to increase resistance to noise?
In the Section 5 scenario, would it help to use additional neurons to encode the network's best guess for the active circuits early on, before noise accumulates, and preserve this over layers? You could do something like track the circuits which have the most active neuron mass associated with them in the first layer of your constructed network (though this would need guarantees like circuits being relatively homogeneous in the norm of their activations).
The reason I think this is a reasonable idea is that LLMs do seem to compute binary indicators of when to use circuits, separate from the circuits themselves. Ferrando et al. found models have features for whether they recognize an entity, which gates fact lookup. In GPT2-Small, the entity recognition circuit is just a single first-layer neuron, while the fact-lookup circuit it controls is presumably very complex (GDM couldn't reverse-engineer it). This suggests networks naturally learn simple early gating for complex downstream computations.
To clarify where i'm coming from, I read the previous post as showing that assuming inactive circuits never activate due to superposition noise, we could get T=O(D2d2). This post shows that inactive circuits erroneously activate in practice, violating that assumption. I’m curious what asymptotics are possible if we remove this assumption and force ourselves to design the network to prevent such erroneous activations. I may be misinterpreting things, though.
Agnostic of the method for embedding the small circuits in the larger network, currently only 1 out of d neurons in each small network is being allocated to storing whether the small network is on or off. I'm suggesting increasing it to cd for some small fixed c, increasing the size of the small networks to (1+c)d neurons. In the rotation example, d is so small that it doesn't really make sense. But i'm just thinking about asymptotically. This should generalise straightforwardly to the "cross-circuit" computation case as well.
The idea is that while each of the cd indicator neurons would be the same as each other in the smaller network, when embedded in the larger network, the noise each small network neuron (distributed across S neurons in the large network) receives is hopefully independent.
This method also works under the assumptions specified in Section 5.2, right? Under Section 5.2 assumptions, it suffices to encode the circuits which are active on the first layer, of which there are at most z. Even if you erroneously believe one of the z circuits is active on a later layer, when it has turned off, the gain comes from eliminating the other T−z inactive circuits. If the on-indicators don't seize, then you can stop any part of the circuit from seizing in the Section 5.2 scenario.
I agree shared state/cross-circuit computation is an important thing to model, though. I guess that's what you mean by more generally? In which case I misunderstood the post completely. I thought it was saying that the construction of the previous post ran into problems in practice. But it seems like you're just saying, if we want this to work more generally, there are issues?
---------------------------------------
This series of posts is really useful, thank you! I have been thinking about it a lot for the past couple of days.