A2z — AI Alignment Forum

I never understood the SAE literature, which came after my earlier work (2019-2020) on sparse inductive biases for feature detection (i.e., semi-supervised decomposition of feature contributions) and interpretability-by-exemplar via model approximations (over the representation space of models), which I originally developed for the goal of bringing deep learning to medicine. Since the parameters of the large neural networks are non-identifiable, the mechanisms for interpretability must shift from understanding individual parameter values to semi-supervised matching against comparable instances and most importantly, to robust and reliable predictive uncertainty over the output, for which we now have effective approaches: https://www.lesswrong.com/posts/YxzxzCrdinTzu7dEf/the-determinants-of-controllable-agi-1

(That said, obviously the normal caveat applies that people should feel free to study whatever they are interested in, as you can never predict what other side effects, learning, and new results---including in other areas---might occur.)

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments