Simplicio: Hey I’ve got an alignment research idea to run by you.

Me: … guess we’re doing this again.

Simplicio: Interpretability work on trained nets is hard, right? So instead of that, what if we pick an architecture and/or training objective to produce interpretable nets right from the get-go?

Me: If we had the textbook of the future on hand, then maybe. But in practice, you’re planning to use some particular architecture and/or objective which will not work.

Simplicio: That sounds like an empirical question! We can’t know whether it works until we try it. And I haven’t thought of any reason it would fail.

Me: Ok, let’s get concrete here. What architecture and/or objective did you have in mind?

Simplicio: Decision trees! They’re highly interpretable, and my decision theory textbook says they’re fully general in principle. So let’s just make a net tree-shaped, and train that! Or, if that’s not quite general enough, we train a bunch of tree-shaped nets as “experts” and then mix them somehow.

Me: Turns out we’ve tried that one! It’s called a random forest, it was all the rage back in the 2000’s.

Simplicio: So we just go back to that?

Me: Alas, they were both a mess and performed very unimpressively by today’s standards. I mean, if you dug around in there you could often find some interpretable structure, but the interpretability problem for random forests was qualitatively similar to the interpretability problem for today’s nets; you couldn’t just glance at them and read off what was going on or what the internal pieces represented.

Simplicio: Ok, so what if we tweak the idea to -

Me: Still not going to work.

Simplicio: You don’t even know what I was going to say!

Me: Suppose someone attempts to draw a map of New York City, and gets it completely wrong. Then they say “What if we tweak it?” and tweak it without actually looking at New York City or any actual map of New York City. What actually happens, if someone does that, is that the map will still be completely wrong.

Simplicio: I take it this is supposed to be an analogy for something I’m doing?

Me: Yes. The fundamental problem is that the real world has a preferred ontology, and you are trying to guess that ontology (or worse, imagining you can choose it), without doing the work to discover it. The real world (or a net trained on it) does not cleanly factor as a decision tree, it does no...

"Why Not Just..."

"Why Not Just..."