Agentic Interpretability: A Strategy Against Gradual Disempowerment
Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord Full paper on arXiv We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising...
Jun 17, 202517