Intuitively, it seems to me that a not-that-powerful AI could do a really good job at interpreting other neural nets via some sort of human feedback for how "easy to understand" an explanation is. I would like to hear why this is right or wrong. 

New Answer
New Comment
1 comment, sorted by Click to highlight new comments since: Today at 6:03 AM

Can I interest you in reading the ~117-page ELK report? It really might answer your question.

Ultimately, how hard this is will depend on how high your standards are. If you want explanations to be good even in new and weird contexts, or even when humans can't figure out how to check the accuracy of outputted explanations, or even when the AI being interpreted is adversarially trying to hide its motivations, the problem can get pretty hard.