Nora Belrose

Posts

Sorted by New

Wiki Contributions

Comments

I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.

  1. I think there are pretty strong reasons to believe that phenomenal consciousness is not actually a substantive property, in the sense that either everything has it in some sense (panpsychism) or nothing does (eliminativism). Any other solution confronts the Hard Problem and the empirical intractability of actually figuring out which things are or are not phenomenally conscious.
  2. Your proposed tests for phenomenal consciousness seem to, in fact, be testing for access consciousness— basically, the ability to do certain types of reflection and introspection. Access consciousness may well be relevant for alignment; it seems pretty related to situational awareness. But that's not phenomenal consciousness (because of the Hard Problem). Phenomenal consciousness is causally inert and empirically untestable.
  3. While it would be a problem if LMs were moral patients, I think these concerns are utterly dwarfed by the value we'd lose due to an AI-caused existential catastrophe. Also, on the most plausible views of valence, an experience's valence is directly determined by your first-order in-the-moment preferences to continue having that experience or not. If valence just reduces to preferences, then we really can just talk about the preferences, which seem more empirically tractable to probe.

I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.

This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.