x

AI ALIGNMENT FORUM

AF

Eleni Angelou — AI Alignment Forum

Eleni Angelou

Eleni Angelou

Message

427

Ω

35

32

15

4

4y

Eleni Angelou

427

Ω

35

4

4y

A Problem to Solve Before Building a Deception Detector

TL;DR: If you are thinking of using interpretability to help with strategic deception, then there's likely a problem you need to solve first: how are intentional descriptions (like deception) related to algorithmic ones (like understanding the mechanisms models use)? We discuss this problem and try to outline some constructive directions....

Feb 7, 2025•78