This is a linkpost for

Abstract: The AGI alignment problem has a bimodal distribution of outcomes with most outcomes clustering around the poles of total success and existential, catastrophic failure. Consequently, attempts to solve AGI alignment should, all else equal, prefer false negatives (ignoring research programs that would have been successful) to false positives (pursuing research programs that will unexpectedly fail). Thus, we propose adopting a policy of responding to points of metaphysical and practical uncertainty associated with the alignment problem by limiting and choosing necessary assumptions to reduce the risk false positives. Herein we explore in detail some of the relevant points of uncertainty that AGI alignment research hinges on and consider how to reduce false positives in response to them.

If you've been following along, I've been working to a particular end the past couple months, and that end is this paper. It's currently under review for journal publication, but you can read the preprint now! This marks the first in what I expect to be several papers exploring and explaining my belief that we can better figure out how to solve alignment via phenomenology and philosophical investigation because there are key questions at the heart of alignment that are poorly examined and not well grounded. This paper is intentionally conservative in its methods since it's the first (you'll notice, aside from a few citations, I stay within the analytic philosophical tradition), and I believe this is more compelling to my target audience of AI researchers, but later papers may make more direct use of phenomenological methods.

It's also the soft launch of the Phenomenological AI Safety Research Institute so that there's a place to work on these ideas. We have no money, but if you're interested in this line of research I'd be happy to talk to you about potential collaborations or research projects we need help with.



New Comment