Polysemantic neurons (neurons that activate for a set of unrelated features) have been seen as a significant obstacle towards interpretability of task-optimized deep networks,^[1] with implications for AI safety.

The classic origin story of polysemanticity is that the data contains more "features" than there are neurons, such that learning to solve a task forces the network to allocate multiple unrelated features to the same neuron, threatening our ability to understand the network's internal processing.

In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, using a combination of...

(Continue Reading - 3033 more words)