With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar.
Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.
Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.
| Unipolar take-offs | Multipolar take-offs | |
| Slow take-offs | <not this post> | Part 1 of this post |
| Fast take-offs | <not this |
> My prior (and present) position is that reliability meeting a certain threshold, rather than being optimized, is a dominant factor in how soon deployment happens.
I don't think we can get to convergence on many of these discussions, so I'm happy to just leave it here for the reader to think through.
Yeah I agree we probably can't reach convergence on how alignment affects deployment time, at least not in this medium (especially since a lot of info about company policies / plans / standards are covered under NDAs), so I also think it's good to leave this... (read more)
Servant of Many Masters: Shifting priorities in Pareto-optimal sequential decision-making
A policy (over some partially observable Markov decision process (POMDP)) is Pareto optimal with respect to two agents with different utility functions if it is not possible to construct a policy that achieves higher utility for one of the agents without doing worse for the other agent. A result by Harsanyi shows that for agents that have the same beliefs, Pareto optimal policies act as if they are maximizing some weighted sum of the two agents' utility functions. However, what if the agents have different beliefs?
Interestingly, if two agents disagree about the world, it is possible to construct policies that are better for both...
Thanks for doing this! I'm excited to see this sequence grow, it's the sort of thing that could serve the function of a journal or textbook.
Special thanks to Kate Woolverton for comments and feedback.
There has been a lot of work and discussion surrounding the speed and continuity of AI takeoff scenarios, which I do think are important variables, but in my opinion ones which are relatively less important when compared to many other axes on which different takeoff scenarios could differ.
In particular, one axis on which different takeoff scenarios can differ that I am particularly interested in is their homogeneity—that is, how similar are the different AIs that get deployed in that scenario likely to be? If there is only one AI, or many copies of the same AI, then you get a very homogenous takeoff, whereas if there are many different AIs trained via very different training regimes, then you get...
OK, thanks. YMMV but some people I've read / talked to seem to think that before we have successful world-takeover attempts, we'll have unsuccessful ones--"sordid stumbles." If this is true, it's good news, because it makes it a LOT easier to prevent successful attempts. Alas it is not true.
A much weaker version of something like this may be true, e.g. the warning shot story you proposed a while back about customer service bots being willingly scammed. It's plausible to me that we'll get stuff like that before it's t... (read more)
This review is part of a project with Joe Collman and Jérémy Perret to try to get as close as possible to peer review when giving feedback on the Alignment Forum. Our reasons behind this endeavor are detailed in our original post asking for suggestions of works to review; but the gist is that we hope to bring further clarity to the following questions:
Instead of thinking about...
I'm probably being just mathematically confused myself; at any rate, I'll proceed with the p[Tk & e+] : p[Tk & e-] version since that comes more naturally to me. (I think of it like: Your credence in Tk is split between two buckets, the Tk&e+ and Tk&e- bucket, and then when you update you rule out the e- bucket. So what matters is the ratio between the buckets; if it's relatively high (compared to the ratio for other Tx's) your credence in Tk goes up, if it's relatively low it goes down.
Anyhow, I totally agree tha... (read more)
This post has benefited greatly from discussion with Sam Eisenstat, Caspar Oesterheld, and Daniel Kokotajlo.
Last year, I wrote a post claiming there was a Dutch Book against CDTs whose counterfactual expectations differ from EDT. However, the argument was a bit fuzzy.
I recently came up with a variation on the argument which gets around some problems; I present this more rigorous version here.
Here, "CDT" refers -- very broadly -- to using counterfactuals to evaluate expected value of actions. It need not mean physical-causal counterfactuals. In particular, TDT counts as "a CDT" in this sense.
"EDT", on the other hand, refers to the use of conditional probability to evaluate expected value of actions.
Put more mathematically, for action , EDT uses , and CDT uses . I'll write and ...
Hmm, on further reflection, I had an effect in mind which doesn't necessarily break your argument, but which increases the degree to which other counterarguments such as AlexMennen's break your argument. This effect isn't necessarily solved by multiplying the contract payoff (since decisions aren't necessarily continuous as a function of utilities), but it may under many circumstances be approximately solved by it. So maybe it doesn't matter so much, at least until AlexMennen's points are addressed so I can see where it fits in with that.
I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.
So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).
Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...
Ah, but there is a sense in which it doesn't. The radical update rule is equivalent to updating on "secret evidence". And in TRL we have such secret evidence. Namely, if we only look at the agent's beliefs about "physics" (the environment), then they would be updated radically, because of secret evidence from "mathematics" (computations).
I agree that radical probabilism can be thought of as bayesian-with-a-side-channel, but it's nice to have a more general characterization where the side channel is black-box, rather than an explicit side-channel which we e... (read more)