Noa Nabeshima — AI Alignment Forum

Noa Nabeshima's Shortform

Jan 15, 20253

Matryoshka Sparse Autoencoders

View trees here Search through latents with a token-regex language View individual latents here See code here (github.com/noanabeshima/matryoshka-saes) Alternate version of this document with appropriate-height interactives. Abstract Sparse autoencoders (SAEs)[1][2] break down neural network internals into components called latents. Smaller SAE latents seem to correspond to more abstract concepts while...

Dec 14, 202498

Poll: Which variables are most strategically relevant?

Which variables are most important for predicting and influencing how AI goes? Here are some examples: * Timelines: “When will crazy AI stuff start to happen?” * Alignment tax: “How much more difficult will it be to create an aligned AI vs an unaligned AI when it becomes possible to...

Jan 22, 202132