x

Alex Spies

Message

PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com

41

3

4y

Alex Spies

Subscribe

Message

PhD Student at Imperial College London. Neurosymbolic AI and Mechanistic Interpretability. Looking forward to spending my retirement as a paperclip. https://afspies.com

41

3

4y

Understanding mesa-optimization using toy models

by tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy, and Can

Overview * Solving the problem of mesa-optimization would probably be easier if we understood how models do search internally * We are training GPT-type models on the toy task of solving mazes and studying them in both a mechanistic interpretability and behavioral context. * This post lays out our model...

May 7, 202346