All of CarlShulman's Comments + Replies

What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Mainly such complete (and irreversible!) delegation to such incompetent systems being necessary or executed. If AI is so powerful that the nuclear weapons are launched on hair-trigger without direction from human leadership I expect it to not be awful at forecasting that risk.

You could tell a story where bargaining problems lead to mutual destruction, but the outcome shouldn't be very surprising on average, i.e. the AI should be telling you about it happening with calibrated forecasts.

3JesseClifton3moOk, thanks for that. I’d guess then that I’m more uncertain than you about whether human leadership would delegate to systems who would fail to accurately forecast catastrophe. It’s possible that human leadership just reasons poorly about whether their systems are competent in this domain. For instance, they may observe that their systems perform well in lots of other domains, and incorrectly reason that “well, these systems are better than us in many domains, so they must be better in this one, too”. Eagerness to deploy before a more thorough investigation of the systems’ domain-specific abilities may be exacerbated by competitive pressures. And of course there is historical precedent for delegation to overconfident military bureaucracies. On the other hand, to the extent that human leadership is able to correctly assess their systems’ competence in this domain, it may be only because there has been a sufficiently successful AI cooperation research program. For instance, maybe this research program has furnished appropriate simulation environments to probe the relevant aspects of the systems’ behavior, transparency tools for investigating cognition about other AI systems, norms for the resolution of conflicting interests and methods for robustly instilling those norms, etc, along with enough researcher-hours applying these tools to have an accurate sense of how well the systems will navigate conflict. As for irreversible delegation — there is the question of whether delegation is in principle reversible, and the question of whether human leaders would want to override their AI delegates once war is underway. Even if delegation is reversible, human leaders may think that their delegates are better suited to wage war on their behalf once it has started. Perhaps because things are simply happening so fast for them to have confidence that they could intervene without placing themselves at a decisive disadvantage.
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

The US and China might well wreck the world  by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment (for both companies and governments). Competitive pressures are the main reason why AI systems with inadequate 1-to-1 alignment would be given long enough leashes to bring catastrophe. I would cosign Vanessa... (read more)

The US and China might well wreck the world by knowingly taking gargantuan risks even if both had aligned AI advisors, although I think they likely wouldn't.

But what I'm saying is really hard to do is to make the scenarios in the OP (with competition among individual corporate boards and the like) occur without extreme failure of 1-to-1 alignment

I'm not sure I understand yet. For example, here’s a version of Flash War that happens seemingly without either the principals knowingly taking gargantuan risks or extreme intent-alignment failure.

  1. The pri

... (read more)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off.  I agree the long-run cost of supporting humans is tiny, but I'm trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.



Right now the United States has a GDP of >$20T, US plus its NATO allies and Japan >$40T, the PRC >$14T,... (read more)

2Andrew Critch4moCarl, thanks for this clear statement of your beliefs. It sounds like you're saying (among other things) that American and Chinese cultures will not engage in a "race-to-the-bottom" in terms of how much they displace human control over the AI technologies their companies develop. Is that right? If so, could you give me a % confidence on that position somehow? And if not, could you clarify? To reciprocate: I currently assign a ≥10% chance of a race-to-the-bottom on AI control/security/safety between two or more cultures this century, i.e., I'd bid 10% to buy in a prediction market on this claim if it were settlable. In more detail, I assign a ≥10% chance to a scenario where two or more cultures each progressively diminish the degree of control they exercise over their tech, and the safety of the economic activities of that tech to human existence, until an involuntary human extinction event. (By comparison, I assign at most around a ~3% chance of a unipolar "world takeover" event, i.e., I'd sell at 3%.) I should add that my numbers for both of those outcomes are down significantly from ~3 years ago due to cultural progress in CS/AI (see this ACM blog post [https://acm-fca.org/2018/03/29/negativeimpacts/]) allowing more discussion of (and hence preparation for) negative outcomes, and government pressures to regulate the tech industry.
Another (outer) alignment failure story

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?

 

I think the one that stands out the most is 'why isn't it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?' I understand the scenario says it isn't because the demonstrations are incomprehensible, but why/how?

7Paul Christiano4moYes, if demonstrations are comprehensible then I don't think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us. The global camera grab must involve plans that aren't clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don't understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that's about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can't figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can't really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control). (There is obviously a lot you could say about all the tools at the human's disposal to circumvent this kind of problem.) This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) "correct" generalization. Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can't solve alignment well enough in the intervening time, I do agree that it's unlikely we can solve it in advance.)
2019 AI Alignment Literature Review and Charity Comparison
MIRI researchers contributed to the following research led by other organisations
MacAskill & Demski's A Critique of Functional Decision Theory

This seems like a pretty weird description of Demski replying to MacAskill's draft.

1Ben Pace2yI also thought so. I wondered maybe if Larks is describing that MacAskill incorporated Demski's comments-on-a-draft into the post.
What failure looks like

OK, thanks for the clarification!

My own sense is that the intermediate scenarios are unstable: if we have fairly aligned AI we immediately use it to make more aligned AI and collectively largely reverse things like Facebook click-maximization manipulation. If we have lost the power to reverse things then they go all the way to near-total loss of control over the future. So i would tend to think we wind up in the extremes.

I could imagine a scenario where there is a close balance among multiple centers of AI+human power, and some but not all of those centers... (read more)

What failure looks like
Failure would presumably occur before we get to the stage of "robot army can defeat unified humanity"---failure should happen soon after it becomes possible, and there are easier ways to fail than to win a clean war. Emphasizing this may give people the wrong idea, since it makes unity and stability seem like a solution rather than a stopgap. But emphasizing the robot army seems to have a similar problem---it doesn't really matter whether there is a literal robot army, you are in trouble anyway.

I agree other powerful tools can achieve the s... (read more)

I do agree there was a miscommunication about the end state, and that language like "lots of obvious destruction" is an understatement.

I do still endorse "military leaders might issue an order and find it is ignored" (or total collapse of society) as basically accurate and not an understatement.

What failure looks like
I think we can probably build systems that really do avoid killing people, e.g. by using straightforward versions of "do things that are predicted to lead to videos that people rate as acceptable," and that at the point when things have gone off the rails those videos still look fine (and to understand that there is a deep problem at that point you need to engage with complicated facts about the situation that are beyond human comprehension, not things like "are the robots killing people?"). I'm not visualizing the case where no
... (read more)

My median outcome is that people solve intent alignment well enough to avoid catastrophe. Amongst the cases where we fail, my median outcome is that people solve enough of alignment that they can avoid the most overt failures, like literally compromising sensors and killing people (at least for a long subjective time), and can build AIs that help defend them from other AIs. That problem seems radically easier---most plausible paths to corrupting sensors involve intermediate stages with hints of corruption that could be recognized by a weaker AI (and hence ... (read more)

What failure looks like

I think the kind of phrasing you use in this post and others like it systematically misleads readers into thinking that in your scenarios there are no robot armies seizing control of the world (or rather, that all armies worth anything at that point are robotic, and so AIs in conflict with humanity means military force that humanity cannot overcome). I.e. AI systems pursuing badly aligned proxy goals or influence-seeking tendencies wind up controlling or creating that military power and expropriating humanity (which eventually couldn't fight back ther... (read more)

I agree that robot armies are an important aspect of part II.

In part I, where our only problem is specifying goals, I don't actually think robot armies are a short-term concern. I think we can probably build systems that really do avoid killing people, e.g. by using straightforward versions of "do things that are predicted to lead to videos that people rate as acceptable," and that at the point when things have gone off the rails those videos still look fine (and to understand that there is a deep problem at that point you need to engage wit... (read more)

The Vox article also mistakes the source of influence-seeking patterns to be about social influence rather than 'systems that try to increase in power and numbers tend to do so, so are selected for if we accidentally or intentionally produce them and don't effectively weed them out; this is why living things are adapted to survive and expand; such desires motivate conflict with humans when power and reproduction can be obtained by conflict with humans, which can look like robot armies taking control.

Yes, I agree the Vox article made this mistake.... (read more)

Two Neglected Problems in Human-AI Safety

I think this is under-discussed, but also that I have seen many discussions in this area. E.g. I have seen it come up and brought it up in the context of Paul's research agenda, where success relies on humans being able to play their part safely in the amplification system. Many people say they are more worried about misuse than accident on the basis of the corruption issues (and much discussion about CEV and idealization, superstimuli, etc addresses the kind of path-dependence and adversarial search you mention).

However, those varied problems mostly ... (read more)

2Wei Dai3yI agree with all of this but I don't think it addresses my central point/question. (I'm not sure if you were trying to, or just making a more tangential comment.) To rephrase, it seems to me that ‘ML safety problems in humans’ is a natural/obvious framing that makes clear that alignment to human users/operators is likely far from sufficient to ensure the safety of human-AI systems, that in some ways corrigibility is actually opposed to safety, and that there are likely technical angles of attack on these problems. It seems surprising that someone like me had to point out this framing to people who are intimately familiar with ML safety problems, and also surprising that they largely respond with silence.