All of Joe Carlsmith's Comments + Replies

Distinguishing AI takeover scenarios

Rohin is correct. In general, I meant for the report's analysis to apply to basically all of these situations (e.g., both inner and outer-misaligned, both multi-polar and unipolar, both fast take-off and slow take-off), provided that the misaligned AI systems in question ultimately end up power-seeking, and that this power-seeking leads to existential catastrophe. 

It's true, though, that some of my discussion was specifically meant to address the idea that absent a brain-in-a-box-like scenario, we're fine. Hence the interest in e.g. deployment decisions, warning shots, and corrective mechanisms.

MIRI/OP exchange about decision theory

Mostly personal interest on my part (I was working on a blog post on the topic, now up), though I do think that the topic has broader relevance.

Draft report on existential risk from power-seeking AI

Hi Koen, 

Glad to hear you liked section 4.3.3. And thanks for pointing to these posts -- I certainly haven't reviewed all the literature, here, so there may well be reasons for optimism that aren't sufficiently salient to me.

Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the... (read more)

1Koen Holtman6moFor AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.
Clarifying inner alignment terminology

Aren't they now defined in terms of each other? 

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned."

2Evan Hubinger8moGood point—and I think that the reference to intent alignment is an important part of outer alignment, so I don't want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Clarifying inner alignment terminology

Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.

Is it something like: “... (read more)

2Evan Hubinger8moMaybe the best thing to use here is just the same definition as I gave for outer alignment—I'll change it to reference that instead.