AI ALIGNMENT FORUMInterpreting a Maze-Solving Network
AF

Interpreting a Maze-Solving Network

Apr 20, 2023 by Alex Turner

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

47Predictions for shard theory mechanistic interpretability results
Alex Turner, Ulisse Mini, peligrietzer
7mo
6
129Understanding and controlling a maze-solving policy network
Alex Turner, peligrietzer, Ulisse Mini, Monte MacDiarmid, David Udell
7mo
18
37Maze-solving agents: Add a top-right vector, make the agent go to the top-right
Alex Turner, peligrietzer, lisathiergart
6mo
7
20Behavioural statistics for a maze-solving agent
peligrietzer, Alex Turner
5mo
10