Can

Understanding mesa-optimization using toy models

by tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, and Can

Overview * Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally * We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context. * This post lays out our model...

May 7, 202346

Can

Can

OthelloGPT learned a bag of heuristics

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Understanding mesa-optimization using toy models

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can

OthelloGPT learned a bag of heuristics

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Understanding mesa-optimization using toy models

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

OthelloGPT learned a bag of heuristics

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Understanding mesa-optimization using toy models