This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I’m going to go through some salient ways you could vary the story.
This isn’t intended to be a particularly great story (and it’s pretty informal). I’m still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories.
I wrote this up a few months ago and was reminded to post...
I've felt like the problem of counterfactuals is "mostly settled" (modulo some math working out) for about a year, but I don't think I've really communicated this online. Partly, I've been waiting to write up more formal results. But other research has taken up most of my time, so I'm not sure when I would get to it.
So, the following contains some "shovel-ready" problems. If you're convinced by my overall perspective, you may be interested in pursuing some of them. I think these directions have a high chance of basically solving the problem of counterfactuals (including logical counterfactuals).
Another reason for posting this rough write-up is to get feedback: am I missing the mark? Is this not what counterfactual reasoning is about? Can you illustrate remaining problems with...
Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.
The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:
When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.
This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible - e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big.
With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar.
Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers.
Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios.
This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower.
|Unipolar take-offs||Multipolar take-offs|
|Slow take-offs||<not this post>||Part 1 of this post|
|Fast take-offs||<not this|
What fun things could one build with +12 orders of magnitude of compute? By ‘fun’ I mean ‘powerful.’ This hypothetical is highly relevant to AI timelines, for reasons I’ll explain later.
I describe a hypothetical scenario that concretizes the question “what could be built with 2020’s algorithms/ideas/etc. but a trillion times more compute?” Then I give some answers to that question. Then I ask: How likely is it that some sort of TAI would happen in this scenario? This second question is a useful operationalization of the (IMO) most important, most-commonly-discussed timelines crux: “Can we get TAI just by throwing more compute at the problem?” I consider this operationalization to be the main contribution of this post; it directly plugs into Ajeya’s timelines
As a follow-up to the Walled Garden discussion about Kelly betting, Scott Garrabrant made some super-informal conjectures to me privately, involving the idea that some class of "nice" agents would "Kelly bet influence", where "influence" had something to do with anthropics and acausal trade.
I was pretty incredulous at the time. However, as soon as he left the discussion, I came up with an argument for a similar fact. (The following does not perfectly reflect what Scott had in mind, by any means. His notion of "influence" was very different, for a start.)
The meat of my argument is just Critch's negotiable RL theorem. In fact, that's practically the entirety of my argument. I'm just thinking about the consequences in a different way from how I have before.