Simon Goldstein — AI Alignment Forum

AI Deception: A Survey of Examples, Risks, and Potential Solutions

By Peter S. Park, Simon Goldstein, Aidan O’Gara, Michael Chen, and Dan Hendrycks [This post summarizes our new report on AI deception, available here] Abstract: This paper argues that a range of current AI systems have learned how to deceive humans. We define deception as the systematic inducement of false...

Aug 29, 202354

Shutdown-Seeking AI

This is a draft written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, and Pamela Robinson, postdoctoral research fellow at the Australian National University, as part of a series of papers for the Center for AI Safety Philosophy Fellowship's midpoint. Abstract: We propose developing AIs...

May 31, 202350

Language Agents Reduce the Risk of Existential Catastrophe

by cdkg and Simon Goldstein

This post was written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, and Cameron Domenico Kirk-Giannini, assistant professor at Rutgers University, for submission to the Open Philanthropy AI Worldviews Contest. Both authors are currently Philosophy Fellows at the Center for AI Safety. Abstract: Recent advances...

May 28, 202339

The Polarity Problem [Draft]

by Dan H, cdkg, and Simon Goldstein

This is a draft written by Cameron Domenico Kirk-Giannini, assistant professor at Rutgers University, and Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, as part of a series of papers for the Center for AI Safety Philosophy Fellowship's midpoint. Dan helped post to the Alignment Forum....

May 23, 202324

Aggregating Utilities for Corrigible AI [Feedback Draft]

by Dan H and Simon Goldstein

This is a draft written by Simon Goldstein, associate professor at the Dianoia Institute of Philosophy at ACU, as part of a series of papers for the Center for AI Safety Philosophy Fellowship. Dan helped post to the Alignment Forum. This draft is meant to solicit feedback. PDF of this...

May 12, 202328