x

AI ALIGNMENT FORUM

AF

Interpreting a Maze-Solving Network — AI Alignment Forum

Interpreting a Maze-Solving Network

Apr 20, 2023 by TurnTrout

Mechanistic interpretability on a pretrained policy network from Goal Misgeneralization in Deep Reinforcement Learning.

47Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini, peligrietzer

3y

6

140Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M, David Udell

3y

23

37Maze-solving agents: Add a top-right vector, make the agent go to the top-right

TurnTrout, peligrietzer, lisathiergart

3y

7

22Behavioural statistics for a maze-solving agent

peligrietzer, TurnTrout

3y

10