[NOTE: This paper was previously titled 'The Shutdown Problem: Three Theorems'.]
This paper is an updated version of the first half of my AI Alignment Awards contest entry. My theorems build on the theorems of Soares, Fallenstein, Yudkowsky, and Armstrong in various ways. These theorems can guide our search for solutions to the shutdown problem.
One aim of the paper is to get academic philosophers and decision theorists interested in the shutdown problem and related topics in AI alignment. They’re my assumed audience. I’m posting here because I think the theorems will also be interesting to people already familiar with the shutdown problem.
For discussion and feedback, I thank Adam Bales, Ryan Carey, Bill D’Alessandro, Tomi... (read 11402 more words →)
This seems like quite a difficult task, especially since we (or some other trusted agent) need to read the outputs of these AIs to get any value from their safety research. Schemers could self-exfiltrate by manipulating us somehow.
... (read more)