AI ALIGNMENT FORUMTags
AF

Inner Alignment

•

Applied to Pacing Outside the Box: RNNs Learn to Plan in Sokoban by Adrià Garriga-Alonso 2d ago

•

Applied to A more systematic case for inner misalignment by Ruben Bloom 7d ago

•

Applied to Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent by Karolis Jucys 10d ago

•

Applied to A simple case for extreme inner misalignment by quila 14d ago

•

Applied to A "Bitter Lesson" Approach to Aligning AGI and ASI by Roger Dearnaley 22d ago

•

Applied to Language for Goal Misgeneralization: Some Formalisms from my MSc Thesis by Giulio Starace 1mo ago

•

Applied to Demystifying "Alignment" through a Comic by Milan Rosko 2mo ago

•

Applied to Finding Backward Chaining Circuits in Transformers Trained on Tree Search by abhayesian 2mo ago

•

Applied to minutes from a human-alignment meeting by bhauth 2mo ago

•

Applied to SAE sparse feature graph using only residual layers by Jaehyuk Lim 2mo ago

•

Applied to Implementing Asimov's Laws of Robotics - How I imagine alignment working. by Joshua Clancy 2mo ago

•

Applied to Visualizing neural network planning by Nevan Wichers 3mo ago

•

Applied to Measuring Learned Optimization in Small Transformer Models by Jonathan Bostock 4mo ago

•

Applied to [Aspiration-based designs] 1. Informal introduction by Jobst Heitzig 4mo ago

•

Applied to On the Confusion between Inner and Outer Misalignment by jacobjacob 4mo ago

•

Applied to Invitation to the Princeton AI Alignment and Safety Seminar by Sadhika Malladi 4mo ago