With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar.
Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.
Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.
|Unipolar take-offs||Multipolar take-offs|
|Slow take-offs||<not this post>||Part 1 of this post|
|Fast take-offs||<not this|
This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I’m going to go through some salient ways you could vary the story.
This isn’t intended to be a particularly great story (and it’s pretty informal). I’m still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories.
I wrote this up a few months ago and was reminded to post...
This is independent research. To make it possible for me to continue writing posts like this, please consider supporting me.
Many thanks to Professor Littman for reviewing a draft of this post.
Yesterday, at a seminar organized by The Center for Human-compatible AI (CHAI), Professor Michael Littman gave a presentation entitled "The HCI of HAI'', or "The Human Computer Interaction of Human-compatible Artificial Intelligence". Professor Littman is a computer science professor at Brown who has done foundational work in reinforcement learning as well as many other areas of computer science. It was a very interesting presentation and I would like to reflect a little on what was said.
The basic question Michael addressed was: "how do we get machines to do what we want?" and his talk was structured around...
This post has benefited greatly from discussion with Sam Eisenstat, Caspar Oesterheld, and Daniel Kokotajlo.
Last year, I wrote a post claiming there was a Dutch Book against CDTs whose counterfactual expectations differ from EDT. However, the argument was a bit fuzzy.
I recently came up with a variation on the argument which gets around some problems; I present this more rigorous version here.
Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sense.
"EDT", on the other hand, refers to the use of conditional probability to evaluate expected value of actions.
Put more mathematically, for action , EDT uses , and CDT uses . I'll write and ...
Superrationality, and generalizations of it, must treat options differently depending on how they're named.
Consider the penny correlation game: Both players decide independently on either head or tails. Then if they decided on the same thing, they each get one util, otherwise they get nothing. You play this game with an exact copy of yourself. You reason: since the other guy is an exact copy of me, whatever I do he will do the same thing. So we will get the util. Then you pick heads because its first alphabetically or some other silly consideration, and then you win. How good that you got to play with a copy, otherwise you would have only gotten half an util.
Now consider the penny anti-correlation game: Both players decide independently on...
This is definitely a hack, but it seems to solve many problems around Cartesian Boundaries. Much of this is development of earlier ideas about the Predict-O-Matic, see there if something is unclear.
Phylactery Decision Theory takes a Base Decision Theory (BDT) as an input and builds something around it, creating a new modified decision theory. Its purpose it to give its base the ability to """learn""" its position in the world.
I'll start by explaining a model of it in a Cartesian context. Lets say we have an agent, with a set of designated input and output channels. Then it makes its "decisions" like this: First, it has a probability distribution over everything, including the values of the output channels in the future, and updates it based on the...