All Posts

Sorted by Magic (New & Upvoted)

Sunday, July 26th 2020
Sun, Jul 26th 2020

Shortform
2Alex Turner11dAn additional consideration for early work on interpretability: it slightly increases the chance we actually get an early warning shot. If a system misbehaves, we can inspect its cognition and (hopefully) find hints of intentional deception. Could motivate thousands of additional researcher-hours being put into alignment.