Joe Carlsmith

Senior research analyst at Open Philanthropy. Recently completed a doctorate in philosophy at the University of Oxford. Opinions my own.

Wiki Contributions


Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe. 

It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

Hi Koen, 

Glad to hear you liked section 4.3.3. And thanks for pointing to these posts -- I certainly haven't reviewed all the literature, here, so there may well be reasons for optimism that aren't sufficiently salient to me.

Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the system has been designed/mechanistically understood at a different level.

Re: “there is a whole body of work which shows that evolved systems are often power-seeking” -- anything in particular you have in mind here?

Hi Daniel, 

Thanks for taking the time to clarify. 

One other factor for me, beyond those you quote, is the “absolute” difficulty of ensuring practical PS-alignment, e.g. (from my discussion of premise 3):

Part of this uncertainty has to do with the “absolute” difficulty of achieving practical PS-alignment, granted that you can build APS systems at all. A system’s practical PS-alignment depends on the specific interaction between a number of variables -- notably, its capabilities (which could themselves be controlled/limited in various ways), its objectives (including the time horizon of the objectives in question), and the circumstances it will in fact exposed to (circumstances that could involve various physical constraints, monitoring mechanisms, and incentives, bolstered in power by difficult-to-anticipate future technology, including AI technology). I expect problems with proxies and search to make controlling objectives harder; and I expect barriers to understanding (along with adversarial dynamics, if they arise pre-deployment) to exacerbate difficulties more generally; but even so, it also seems possible to me that it won’t be “that hard” (by the time we can build APS systems at all) to eliminate many tendencies towards misaligned power-seeking (for example, it seems plausible to me that selecting very strongly against (observable) misaligned power-seeking during training goes a long way), conditional on retaining realistic levels of control over a system’s post-deployment capabilities and circumstances (though how often one can retain this control is a further question).

My sense is that relative to you, I am (a) less convinced that ensuring practical PS-alignment will be “hard” in this absolute sense, once you can build APS systems at all (my sense is that our conceptions of what it takes to “solve the alignment problem” might be different), (b) less convinced that practically PS-misaligned systems will be attractive to deploy despite their PS-misalignment (whether because of deception, or for other reasons), (c) less convinced that APS systems becoming possible/incentivized by 2035 implies “fast take-off” (it sounds like you’re partly thinking: those are worlds where something like the scaling hypothesis holds, and so you can just keep scaling up; but I don’t think the scaling hypothesis holding to an extent that makes some APS systems possible/financially feasible implies that you can just scale up quickly to systems that can perform at strongly superhuman levels on e.g. ~any task, whatever the time horizons, data requirements, etc), and (d) more optimistic about something-like-present-day-humanity’s ability to avoid/prevent failures at a scale that disempower ~all of humanity (though I do think Covid, and its policitization, an instructive example in this respect), especially given warning shots (and my guess is that we do get warning shots both before or after 2035, even if APS systems become possible/financially feasible before then).

Re: nuclear winter, as I understand it, you’re reading me as saying: “in general, if a possible and incentivized technology is dangerous, there will be warning shots of the dangers; humans (perhaps reacting to those warning shots) won’t deploy at a level that risks the permanent extinction/disempowerment of ~all humans; and if they start to move towards such disempowerment/extinction, they’ll take active steps to pull back.” And your argument is: “if you get to less than 10% doom on this basis, you’re going to give too low probabilities on scenarios like nuclear winter in the 20th century.” 

I don’t think of myself as leaning heavily on an argument at that level of generality (though maybe there’s a bit of that). For example, that statement feels like it’s missing the “maybe ensuring practical PS-alignment just isn’t that hard, especially relative to building practically PS-misaligned systems that are at least superficially attractive to deploy” element of my own picture. And more generally, I expect to say different things about e.g. biorisk, climate change, nanotech, etc, depending on the specifics, even if generic considerations like “humans will try not to all die” apply to each.

Re: nuclear winter in particular, I’d want to think a bit more about what sort of probability I’d put on nuclear winter in the 20th century (one thing your own analysis skips is the probability that a large nuclear conflict injects enough smoke into the stratosphere to actually cause nuclear winter, which I don’t see as guaranteed -- and we’d need to specify what level of cooling counts). And nuclear winter on its own doesn’t include a “scaling to the permanent disempowerment/extinction of ~all of humanity” step -- a step that, FWIW, I see as highly questionable in the nuclear winter case, and which is important to my own probability on AI doom (see premise 5). And there are various other salient differences: for example, mutually assured destruction seems like a distinctly dangerous type of dynamic, which doesn’t apply to various AI deployment scenarios; nuclear weapons have widespread destruction as their explicit function, whereas most AI systems won’t; and so on. That said, I think comparisons in this vein could still be helpful; and I’m sympathetic to points in the vein of “looking at the history of e.g. climate, nuclear risk, BSL-4 accidents, etc the probability that humans will deploy technology that risks global catastrophe, and not stop doing so even after getting significant evidence about the risks at stake, can’t be that low” (I talk about this a bit in 4.4.3 and 6.2).

Hi Daniel, 

Thanks for reading. I think estimating p(doom) by different dates (and in different take-off scenarios) can be a helpful consistency check, but I disagree with your particular “sanity check” here -- and in particular, premise (2). That is, I don’t think that conditional on APS-systems becoming possible/financially feasible by 2035, it’s clear that we should have at least 50% on doom (perhaps some of disagreement here is about what it takes for the problem to be "real," and to get "solved"?). Nor do I see 10% on “Conditional it being both possible and strongly incentivized to build APS systems, APS systems will end up disempowering approximately all of humanity” as obviously overconfident (though I do take some objections in this vein seriously). I’m not sure exactly what “10% on nuclear war” analog argument you have in mind: would you be able to sketch it out, even if hazily?

Aren't they now defined in terms of each other? 

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned."

Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.

Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic." If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be "aligned" because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse? 

Or is the thought something like: "the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn't take actions we would view as bad/problematic/dangerous/catastrophic"? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. "the agent's pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior."