jake_mendel — AI Alignment Forum

Research directions Open Phil wants to fund in technical AI safety

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety. This guide provides an opinionated...

Feb 8, 202596

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality. Applications (here) start with a simple 300 word expression of interest and...

Feb 6, 2025111

Attribution-based parameter decomposition

by Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, and Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...

Jan 25, 2025109

Circuits in Superposition: Compressing many small neural networks into one

by Lucius Bushnaq and jake_mendel

Tl;dr: We generalize the mathematical framework for computation in superposition from compressing many boolean logic gates into a neural network, to compressing many small neural networks into a larger neural network. The number of small networks we can fit into the large network depends on the small networks' total parameter...

Oct 14, 2024131

jake_mendel's Shortform

Sep 19, 20245

[Interim research report] Activation plateaus & sensitive directions in GPT2

by StefanHex and jake_mendel

This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are...

Jul 5, 202466

SAE feature geometry is outside the superposition hypothesis

Written at Apollo Research Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking...

Jun 24, 2024229

Jake Mendel

Jake Mendel

Jake Mendel

SAE feature geometry is outside the superposition hypothesis

Toward A Mathematical Framework for Computation in Superposition

Circuits in Superposition: Compressing many small neural networks into one

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Jake Mendel

SAE feature geometry is outside the superposition hypothesis

Toward A Mathematical Framework for Computation in Superposition

Circuits in Superposition: Compressing many small neural networks into one

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Attribution-based parameter decomposition

Circuits in Superposition: Compressing many small neural networks into one

jake_mendel's Shortform

[Interim research report] Activation plateaus & sensitive directions in GPT2

SAE feature geometry is outside the superposition hypothesis