x

AI ALIGNMENT FORUM

AF

entropizer

entropizer

Message

4

5

9y

entropizer

4

9y

entropizer — AI Alignment Forum

Activation space interpretability may be doomed

I agree with much of this, but I suspect people aren't only sticking with activation-based interpretability because the bad dimensionality of weight-based interpretability is intimidating. Rather, I feel like we have to be thinking about activation-based interpretability if we want an analysis of the model's behavior to contain semantics that are safety-relevant.

For example, I can know nothing about how safe a classifier that distinguishes A from B is, regardless of how much I know about its weights, unless I know what A and B are. There might be identical... (read more)