Six AI Risk/Strategy Ideas

[-]Daniel Kokotajlo6y70

I particularly like your "Logical vs. physical risk aversion" distinction, and agree that we should prioritize reducing logical risk. I think acausal trade makes this particularly concrete. If we make a misaligned superintelligence that "plays nice" in the acausal bargaining community I'd think that's better than making an aligned superintelligence that doesn't, because overall it matters far more that the community is nice than that it have a high population of people with our values.

I also really like your point about how providing evidence that AI safety is difficult may be one of the most important reasons to do AI safety research. I guess I'd like to see some empirically grounded analysis of how likely it is that the relevant policymakers and so forth will be swayed by such things. So far it seems like they've been swayed by direct arguments that the problem is hard, and not so much by our failures to make progress. If anything failure of AI safety researchers to make progress seems to encourage their critics.

[-]habryka5y30Review for 2019 Review

I have now linked at least 10 times to the heading on "'Generate evidence of difficulty' as a research purpose" section of this post. It was a thing that I kind of wanted to point to before this post came out, but felt confused about it, and this post finally gave me a pointer to it.

I think that section was substantially more novel and valuable to me than the rest of this post, but it is also evidence that others might have also not had some of the other ideas on their map, and so they might found it similarly valuable because of a different section.

[-]Buck6y30

Minor point: I think asteroid strikes are probably very highly correlated between Everett branches (though maybe the timing of spotting an asteroid on a collision course is variable).

[-]Wei Dai6y40

I think if we could look at all the Everett branches that contain some version of you, we'd see "bundles" where the asteroid locations are the same within each bundle but different between bundles, because different bundles evolved from different starting conditions (and then converged in terms of having produced someone who is subjectively indistinguishable from you). So a big asteroid strike would wipe out humanity in an entire bundle but that would only constitute a small fraction of all the Everett branches that contain a version of you.

Hopefully that makes sense?

[-]Buck6y20

Ah yes this seems totally correct

[-]cousin_it6y30

Multiple simultaneous DSAs under CAIS

Taking over the world is a big enough prize, compared to the wealth of a typical agent, that even a small chance of achieving it should already be enough to act. And waiting is dangerous if there's a chance of other agents outrunning you. So multiple agents having DSA but not acting for uncertainty reasons seems unlikely.

Logical vs physical risk aversion

Imagine you care about the welfare of two koalas living in separate rooms. Given a choice between both koalas dying with probability 1/2 or a randomly chosen koala dying with probability 1, why is the latter preferable?

You could say our situation is different because we're the koala. Fine. Imagine you're choosing between a 1/2 physical risk and a 1/2 logical risk to all humanity, but both of them will happen in 100 years when you're already dead, so the welfare of your copies isn't at question. Why is the physical risk preferable? How is that different from the koala situation?

[-]Wei Dai6y20

Taking over the world is a big enough prize, compared to the wealth of a typical agent, that even a small chance of achieving it should already be enough to act.

In CAIS, AI services aren't agents themselves, especially the lower level ones. If they're controlled by humans, their owners/operators could well be risk verse enough (equivalently, not assign high enough utility to taking over the world) to not take advantage of a DSA given their uncertainty.

Imagine you’re choosing between a 1⁄2 physical risk and a 1⁄2 logical risk to all humanity, but both of them will happen in 100 years when you’re already dead, so the welfare of your copies isn’t at question. Why is the physical risk preferable?

I don't think it's possible for the welfare of my copies to not be at question. See this comment.

Another line of argument is that suppose we'll end up getting most of our utility from escaping simulations and taking over much bigger/richer universes. In those bigger universes we might eventually meet up with copies of us from other Everett branches and have to divide up the universe with them. So physical risk isn't as concerning in that scenario because the surviving branches will end up with larger shares of the base universes.

A similar line of thought is that in an acausal trade scenario, each surviving branch of a physical risk could get a better deal because whatever thing of value they have to offer has become more scarce in the multiverse economy.

[-]cousin_it6y*30

Many such intuitions seem to rely on "doors" between worlds. That makes sense - if we have two rooms of animals connected by a door, then killing all animals in one room will just lead to it getting repopulated from the other room, which is better than killing all animals in both rooms with probability 1/2. So in that case there's indeed a difference between the two kinds of risk.

The question is, how likely is a door between two Everett branches, vs. a door connecting a possible world with an impossible world? With current tech, both are impossible. With sci-fi tech, both could be possible, and based on the same principle (simulating whatever is on the other side of the door). But maybe "quantum doors" are more likely than "logical doors" for some reason?

[-]evhub6y*30

Another argument that definitely doesn't rely on any sort of "doors" for why physical risk might be preferable to logical risk is just if you have diminishing returns on the total number of happy humans. As long as your returns to happy humans are sublinear (logarithmic is a standard approximation, though anything sublinear works), then you should prefer a guaranteed shot at $\frac{1}{2}$ the Everett branches having lots of happy humans to a $\frac{1}{2}$ chance of all the Everett branches having happy humans. To see this, suppose $U : N \to R$ measures your returns to the total number of happy humans across all Everett branches. Let $N$ be the total number of happy humans in a good Everett branch and $M$ the total number of Everett branches. Then, in the physical risk situation, you get $U_{physical risk} = U ⎛ ⎜ ⎜ ⎝ \frac{M}{2} \sum i = 1 N ⎞ ⎟ ⎟ ⎠ = U (\frac{M N}{2})$ whereas, in the logical risk situation, you get $U_{logical risk} = \frac{1}{2} U (0) + \frac{1}{2} U (M \sum i = 1 N) = \frac{1}{2} U (M N)$ which are only equal if $U$ is linear. Personally, I think my returns are sublinear, since I pretty strongly want there to at least be some humans—more strongly than I want there to be more humans, though I want that as well. Furthermore, if you believe there's a chance that the universe is infinite, then you should probably be using some sort of measure over happy humans rather than just counting the number, and my best guess for what such a measure might look like seems to be at least somewhat locally sublinear.

[-]Wei Dai6y*10

So you're saying that (for example) there could be a very large universe that is running simulations of both possible worlds and impossible worlds, and therefore even if we go extinct in all possible worlds, versions of us that live in the impossible worlds could escape into the base universe so the effect of a logical risk would be similar to a physical risk of equal magnitude (if we get most of our utility from controlling/influencing such base universes). Am I understanding you correctly?

If so, I have two objections to this. 1) Some impossible worlds seem impossible to simulate. For example suppose in the actual world AI safety requires solving metaphilosophy. How would you simulate an impossible world in which AI safety doesn't require solving metaphilosophy? 2) Even for the impossible worlds that maybe can be simulated (e.g., where the trillionth digit of pi is different from what it actually is) it seems that only a subset of reasons for running simulations of possible worlds would apply to impossible worlds, so I'm a lot less sure that "logical doors" exist than I am that "quantum doors" exist.

[-]cousin_it6y*30

It seems to me that AI will need to think about impossible worlds anyway - for counterfactuals, logical uncertainty, and logical updatelessness/trade. That includes worlds that are hard to simulate, e.g. "what if I try researching theory X and it turns out to be useless for goal Y?" So "logical doors" aren't that unlikely.

[-]Rohin Shah6y20

Planned summary for the Alignment Newsletter:

This post briefly presents three ways that power can become centralized in a world with <@Comprehensive AI Services@>(@Reframing Superintelligence: Comprehensive AI Services as General Intelligence@), argues that under risk aversion "logical" risks can be more concerning than physical risks because they are more correlated, proposes combining human imitations and oracles to remove the human in the loop and become competitive, and suggests doing research to generate evidence of difficulty of a particular strand of research.

[-]Ben Pace5y10Nomination for 2019 Review

The first three examples here have been pretty helpful to me in considering how DSAs and takeoffs will go and why they may be dangerous.

[-]habryka5y10Nomination for 2019 Review

I've referred specifically to the section on "Generate evidence of difficulty" as a research purpose many times since this post has come out, and while I have disagreements with it, I do really like it as a handle for a consideration that I hadn't previously seen written up, and does strike me as quite important.

AI ALIGNMENT FORUM
Petrov Day
AF

AI ALIGNMENT FORUM
Petrov Day
AF

31

31

The "search engine" model of AGI development

Coordination as an AGI service

Multiple simultaneous DSAs under CAIS

Logical vs physical risk aversion

Combining oracles with human imitations

"Generate evidence of difficulty" as a research purpose