Monthly Algorithmic Problems in Mech Interp

Wiki Contributions


Probably the best explanation of this comes from John Wentworth's recent AXRP podcast, and a few of his LW posts. To put it simply, modularity is important because modular systems are usually much more interpretable (case in point: evolution has produced highly modular designs, e.g. organs and organ systems, whereas genetic algorithms for electronic circuit design frequently fail to find designs that are modular, and so they're really hard for humans to interpret, and verify that they'll work as expected). If we understood a bit more about the factors that select for modularity under a wide range of situations (e.g. evolutionary selection, or standard ML selection), then we might be able to use these factors to encourage more modular designs. On the more abstract level, it might help us break down fuzzy statements like "certain types of inner optimisers have separate world models and models of the objective", which are really statements about modules within a system. But in order to do any of this, we need to come up with a robust measure for modularity, and basically there isn't one at present.

Answer by CallumMcDougall00

This may not exactly answer the question, but I'm in a research group which is studying selection for modularity, and yesterday we published our fourth post, which discusses the importance of causality in developing a modularity metric.

TL;DR - if you want to measure information exchanged in a network, you can't just observe activations, because two completely separate tracks of the network measuring the same thing will still have high mutual information even though they're not communicating with each other (the input is a confounder for both of them). Instead, it seems like you'll need to use do calculus and counterfactuals.

We haven't actually started testing out our measure yet so this is currently only at the theorising stage, hence may not be a very satisfying answer to the question