Mechanistic Transparency for Machine Learning

[-]Wei Dai6y60

In programming, it's often easier to write new code from scratch than to try to understand someone else's code, especially if the other person's code is optimized for something other than human-understandability. See here for an example, where I wrote:

Many of the algorithms and tables used here came from the deflate implementation by Jean-loup Gailly, which was included in Crypto++ 4.0 and earlier. I completely rewrote it in order to fix a bug that I could not figure out. This code is less clever, but hopefully more understandable and maintainable.

Since human-understandability is costly to evaluate (and hence to train), and also costly in terms of causing lower performance on other metrics (note that code that I wrote to be more understandable is significantly slower than the original code), I have strong doubts about this line of research.

My guess is that if you took a human-level AGI that was the result of something like deep learning optimizing only for capability (and not understandability), and tried to interpret it as pseudocode, you'll end up with so many modules with so many interactions between them that no human or team of humans could understand it. In other words, you'll end up with spaghetti code written by a superintelligence (meaning the training process).

If you instead tried to optimize for both capability and understandability at the same time, you have a much harder ML problem on your hands, maybe even an impossible one.

Perhaps if an AGI is built out of modules that are separately trained, instead of being trained end-to-end, you could use this idea on some of the smaller modules that are especially important to safety. I'm curious if that's the kind of plan you have in mind, or if you're more ambitious about this approach.

[-]DanielFilan6y30

This response is rather late, but basically my hope is that it's possible to optimise for understandability by regularising for some relatively simple quantity that induces understandability.

Perhaps if an AGI is built out of modules that are separately trained, instead of being trained end-to-end, you could use this idea on some of the smaller modules that are especially important to safety. I'm curious if that's the kind of plan you have in mind, or if you're more ambitious about this approach.

I'm more ambitious, and fear that that might not work: either you train a bunch of 'small' things that do very concrete tasks, and aren't quite sure how to combine them to create AGI (or you have to combine a huge number of them and hope that errors don't cascade), or you train a few large ones that do big, complicated tasks that themselves are hard to interpret. That being said, the first branch would satisfy my desiderata for the approach, and I'd hope some people are working on it.

[-]Max Kanwal7y50

I see two major challenges (one of which leans heavily on progress in linguistics). I can see there being mathematical theory to guide candidate model decompositions (Challenge 1), but I imagine that linking up a potential model decomposition to a theory of 'semantic interpretability' (Challenge 2) is equally hard, if not harder.

Any ideas on how you plan to address Challenge 2? Maybe the most robust approach would involve active learning of the pseudocode, where a human guides the algorithm in its decomposition and labeling of each abstract computation.

[-]DanielFilan7y30

Thoughts on challenge 2:

'Smaller' functions will probably be more human-interpretable, just because they do less, are easy to analyse, and have less weird stuff going on. I think that this implies that as you 'double-click' on more high-level primitives, they get more and more interpretable.
It's plausible to me that there's some mathematical theory of how to get things that are human-interpretable enough for our purposes.
It's also plausible to me that by trying enough things, you find a method that seems sort of human-interpretable, see what properties it actually has, and check if you can use those.
There might be synergies with interpretability techniques like neuron visualisation that give you a sense of the input-output behaviour without telling you much about the internal mechanisms.
If a neural network is well-trained, it's easier to visualise what each neuron does, because intuitively they need to do sensible things for the outputs to be sensible. You could hope that a similar property for high-level primitives obtains if those primitives are constructed sensibly out of neurons.

[-]Nisan7y30

"Programmatically Interpretable Reinforcement Learning" (Verma et al.) seems related. It would be great to see modular, understandable glosses of neural networks.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

19

Mechanistic Transparency for Machine Learning

19