AI ALIGNMENT FORUM
AF

jiaxin wen
010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Fabien's Shortform
jiaxin wen8mo24

Thanks for sharing! I'm a bit surprised that sleeper agent is listed as the best demo (e.g., higher than alignment faking). Do you focus on the main idea instead of specific operationalization here -- asking because I think backdoored/password-locked LMs could be quite different from real-world threat models.

Reply
No wikitag contributions to display.
No posts to display.