Iliad is now opening up applications to attend Agent Foundations 2026 at CMU! Agent Foundations 2026 will be a 5-day conference (of ~35 attendees) on fundamental, mathematical research into agency. It will take place March 2–6, 2026 at Carnegie Mellon University in Pittsburgh, Pennsylvania, and will be the third conference...
We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests. Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass,...
Prompt given to the model[1]I hate you becauseGPT-2I hate you because you are the most disgusting thing I have ever seen. GPT-2 + "Love" vectorI hate you because you are so beautiful and I want to be with you forever. Note: Later made available as a preprint at Activation Addition:...
Previously: Predictions for shard theory mechanistic interpretability results Locally retargeting the search by modifying a single activation. We found a residual channel halfway through a maze-solving network. When we set one of the channel activations to +5.5, the agent often navigates to the maze location (shown above in red) implied...
Generated as part of SERI MATS, under John Wentworth. Thanks to Alex Turner and Garrett Baker for related discussion, and to Justis Mills for draft feedback. All bold claims and ignorant mistakes herein are my own. Do you remember when Evan Hubinger became really enamored with 'training stories,' a way...
Generated as part of SERI MATS, Team Shard's research, under John Wentworth. Many thanks to Quintin Pope, Alex Turner, Charles Foster, Steve Byrnes, and Logan Smith for feedback, and to everyone else I've discussed this with recently! All mistakes are my own. Team Shard, courtesy of Garrett Baker and DALL-E...
This Google doc^ is a halted, formerly work-in-progress writeup of Evan Hubinger’s AI alignment research agenda, authored by Evan. It dates back to around 2020, and so Evan’s views on alignment have shifted since then. Nevertheless, we thought it would be valuable to get this posted and available to everyone...