Emergent modularity and safety

[-]johnswentworth4y110

Our default expectation about large neural networks should be that we will understand them in roughly the same ways that we understand biological brains, except where we have specific reasons to think otherwise.

Why would that be our default expectation? We don't have direct access to all of the underlying parameters in the brain. We can't even simulate it yet, let alone take a gradient.

[-]Richard_Ngo4y50

Why would that be our default expectation?

Lots of reasons. Neural networks are modelled after brains. They both form distributed representations at very large scales, they both learn over time, etc etc. Sure, you've pointed out a few differences, but the similarities are so great that this should be the main anchor for our expectations (rather than, say, thinking that we'll understand NNs the same way we understand support vector machines, or the same way we understand tree search algorithms, or...).

[-]TurnTrout4y70

I'm not convinced that these similarities are great enough to merit such anchoring. Just because NNs have more in common with brains than with SVMs, does not imply that we will understand NNs in roughly the same ways that we understand biological brains. We could understand them in a different set of ways than we understand biological brains, and differently than we understand SVMs.

Rather than arguing over reference class, it seems like it would make more sense to note the specific ways in which NNs are similar to brains, and what hints those specific similarities provide.

[-]johnswentworth4y30

Perhaps a good way to summarize all this is something like "qualitatively similar models probably work well for brains and neural networks". I agree to a large extent with that claim (though there was a time when I would have agreed much less), and I think that's the main thing you need for the rest of the post.

"Ways we understand" comes across as more general than that - e.g. we understand via experimentally probing physical neurons vs spectral clustering of a derivative matrix.

[-]Daniel_Eth4y30

The statement seems almost tautological – couldn't we somewhat similarly claim that we'll understand NNs in roughly the same ways that we understand houses, except where we have reasons to think otherwise? The "except where we have reasons to think otherwise" bit seems to be doing a lot of work.

[-]Richard_Ngo4y20

Compare: when trying to predict events, you should use their base rate except when you have specific updates to it.

Similarly, I claim, our beliefs about brains should be the main reference for our beliefs about neural networks, which we can then update from.

I agree that the phrasing could be better; any suggestions?

[-]johnswentworth4y40

I agree that the phrasing could be better; any suggestions?

I actually think you could just drop that intro altogether, or move it later into the post. We do have pretty good evidence of modularity in the brain (as well as other biological systems) and in trained neural nets; it seems to be a pretty common property of large systems "evolved" by local optimization. And the rest of the post (as well as some of the other comments) does a good job of talking about some of that evidence. It's a good post, and I think the arguments later in the post are stronger than that opening.

(On the other hand, if you're opening with it because that was your own main prior, then that makes sense. In that case, maybe note that it was a prior for you, but that the evidence from other directions is strong enough that we don't need to rely much on that prior?)

[-]Richard_Ngo4y40

Thanks, that's helpful. I do think there's a weak version of this which is an important background assumption for the post (e.g. without that assumption I'd need to explain the specific ways in which ANNs and BNNs are similar), so I've now edited the opening lines to convey that weak version instead. (I still believe the original version but agree that it's not worth defending here.)

[-]Daniel_Eth4y10

Yeah, I'm not trying to say that the point is invalid, just that phrasing may give the point more appeal than is warranted from being somewhat in the direction of a deepity. Hmm, I'm not sure what better phrasing would be.