x

AI ALIGNMENT FORUM

AF

Helena Casademunt — AI Alignment Forum

Helena Casademunt

Helena Casademunt

Message

88

3

11mo

Helena Casademunt

88

11mo

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

by Bartosz Cywiński, Helena Casademunt, Khoi Tran, aryaj, Sam Marks, and Neel Nanda

TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing generated falsehoods. This post presents a summary of the paper, including examples...

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

by kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan, and Neel Nanda

Summary * We introduce an interpretability-based technique for controlling how fine-tuned LLMs generalize out-of-distribution, without modifying training data. * We show it can mitigate emergent misalignment by training models that write insecure code without becoming misaligned. * It can also reduce sensitivity to spurious cues, even when they are present...

Jul 23, 2025•79