AI ALIGNMENT FORUM
AF

CallumMcDougall
Ω3061213
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Monthly Algorithmic Problems in Mech Interp
Examples of practical implications of Judea Pearl's Causality work
CallumMcDougall3y00

Probably the best explanation of this comes from John Wentworth's recent AXRP podcast, and a few of his LW posts. To put it simply, modularity is important because modular systems are usually much more interpretable (case in point: evolution has produced highly modular designs, e.g. organs and organ systems, whereas genetic algorithms for electronic circuit design frequently fail to find designs that are modular, and so they're really hard for humans to interpret, and verify that they'll work as expected). If we understood a bit more about the factors that select for modularity under a wide range of situations (e.g. evolutionary selection, or standard ML selection), then we might be able to use these factors to encourage more modular designs. On the more abstract level, it might help us break down fuzzy statements like "certain types of inner optimisers have separate world models and models of the objective", which are really statements about modules within a system. But in order to do any of this, we need to come up with a robust measure for modularity, and basically there isn't one at present.

Reply
Examples of practical implications of Judea Pearl's Causality work
Answer by CallumMcDougallJul 02, 202200

This may not exactly answer the question, but I'm in a research group which is studying selection for modularity, and yesterday we published our fourth post, which discusses the importance of causality in developing a modularity metric.

TL;DR - if you want to measure information exchanged in a network, you can't just observe activations, because two completely separate tracks of the network measuring the same thing will still have high mutual information even though they're not communicating with each other (the input is a confounder for both of them). Instead, it seems like you'll need to use do calculus and counterfactuals.

We haven't actually started testing out our measure yet so this is currently only at the theorising stage, hence may not be a very satisfying answer to the question

Reply
58Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
3mo
6
42SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
7mo
1
41A Selection of Randomly Selected SAE Features
1y
0
30SAE-VIS: Announcement Post
1y
0
10Mech Interp Challenge: January - Deciphering the Caesar Cipher Model
2y
0
32Interpretability with Sparse Autoencoders (Colab exercises)
2y
0
6Mech Interp Challenge: November - Deciphering the Cumulative Sum Model
2y
0
38[Paper] All's Fair In Love And Love: Copy Suppression in GPT-2 Small
2y
0
7Mech Interp Challenge: October - Deciphering the Sorted List Model
2y
0
14Mech Interp Challenge: September - Deciphering the Addition Model
2y
0
Load More
Modularity
3y
(+1133)
Modularity
3y