peligrietzer — AI Alignment Forum

Shallow review of technical AI safety, 2025

by technicalities, Tomáš Gavenčiak, Stephen McAleese, peligrietzer, Stag, jordinne, ozziegooen, Violet Hour, and lenz

WebsiteEditorialRepo Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025) This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website. It’s shallow in the sense that...

Dec 17, 2025194

The Problem With the Word ‘Alignment’

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushita jha), for AOI[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment. The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human...

May 21, 202464

Paper: Understanding and Controlling a Maze-Solving Policy Network

by TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, and lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...

Oct 13, 202370

Behavioural statistics for a maze-solving agent

Summary: Understanding and controlling a maze-solving policy network analyzed a maze-solving agent's behavior. We isolated four maze properties which seemed to predict whether the mouse goes towards the cheese or towards the top-right corner: In this post, we conduct a more thorough statistical analysis, addressing issues of multicollinearity. We show...

Apr 20, 202346

Maze-solving agents: Add a top-right vector, make the agent go to the top-right

by TurnTrout, peligrietzer, and lisathiergart

Overview: We modify the goal-directed behavior of a trained network, without any gradients or finetuning. We simply add or subtract "motivational vectors" which we compute in a straightforward fashion. In the original post, we defined a "cheese vector" to be "the difference in activations when the cheese is present in...

Mar 31, 2023101

Understanding and controlling a maze-solving policy network

by TurnTrout, peligrietzer, Ulisse Mini, Monte M, and David Udell

Previously: Predictions for shard theory mechanistic interpretability results Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied...

Mar 11, 2023336

Predictions for shard theory mechanistic interpretability results

by TurnTrout, Ulisse Mini, and peligrietzer

How do agents work, internally? My (TurnTrout's) shard theory MATS team set out to do mechanistic interpretability on one of the goal misgeneralization agents: the cheese-maze network. The network in action on its training distribution, where cheese is randomly spawned in the top-right 5x5 available grid region. For more training...

Mar 1, 2023105