DragonGod — AI Alignment Forum

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Abstract Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability...

Jan 13, 202362

Cinera Verinia

Cinera Verinia

Cinera Verinia

Why The Focus on Expected Utility Maximisers?

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Beren's "Deconfusing Direct vs Amortised Optimisation"

DragonGod's Shortform

Cinera Verinia

Beren's "Deconfusing Direct vs Amortised Optimisation"

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Why The Focus on Expected Utility Maximisers?

DragonGod's Shortform

Why The Focus on Expected Utility Maximisers?

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Beren's "Deconfusing Direct vs Amortised Optimisation"

DragonGod's Shortform