Diogo de Lucena — AI Alignment Forum

Mistral Large 2 (123B) seems to exhibit alignment faking

by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Kvee, Mike Vaiana, and Trent Hodgeson

UPDATE: Recent work with improved AF and compliance gap classifiers disagrees with our results. We recommend using the improved classifiers. Summary We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models)....

Mar 27, 202581

Reducing LLM deception at scale with self-other overlap fine-tuning

by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Kvee, Cameron Berg, Mike Vaiana, and Trent Hodgeson

This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support from AE Studio. Summary In this post, we summarize the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which...

Mar 13, 2025162

Self-prediction acts as an emergent regularizer

by Cameron Berg, Kvee, Mike Vaiana, Diogo de Lucena, florin_pop, and Trent Hodgeson

TL;DR: In our recent work with Professor Michael Graziano (arXiv, thread), we show that adding an auxiliary self-modeling objective to supervised learning tasks yields simpler, more regularized, and more parameter-efficient models. Across three classification tasks and two modalities, self-modeling consistently reduced complexity (lower RLCT, narrower weight distribution). This restructuring effect...

Oct 23, 202492

Self-Other Overlap: A Neglected Approach to AI Alignment

by Marc Carauleanu, Mike Vaiana, Kvee, Diogo de Lucena, Cameron Berg, and Trent Hodgeson

Figure 1. Image generated by DALL·E 3 to represent the concept of self-other overlap Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. This research was conducted at AE Studio and...

Jul 30, 2024247