AI ALIGNMENT FORUM
AF

808
Interpreting a Maze-Solving Network

Interpreting a Maze-Solving Network

Apr 20, 2023 by TurnTrout

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

47Predictions for shard theory mechanistic interpretability results
TurnTrout, Ulisse Mini, peligrietzer
3y
6
140Understanding and controlling a maze-solving policy network
TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell
3y
23
37Maze-solving agents: Add a top-right vector, make the agent go to the top-right
TurnTrout, peligrietzer, lisathiergart
3y
7
22Behavioural statistics for a maze-solving agent
peligrietzer, TurnTrout
2y
10