All of Joe Carlsmith's Comments + Replies

Draft report on existential risk from power-seeking AI

Hi Koen, 

Glad to hear you liked section 4.3.3. And thanks for pointing to these posts -- I certainly haven't reviewed all the literature, here, so there may well be reasons for optimism that aren't sufficiently salient to me.

Re: black boxes, I do think that black-box systems that emerge from some kind of evolution/search process are more dangerous; but as I discuss in 4.4.1, I also think that the bare fact that the systems are much more cognitively sophisticated than humans creates significant and safety-relevant barriers to understanding, even if the... (read more)

1Koen Holtman10dFor AI specific work, the work by Alex Turner mentioned elsewhere in this comment section comes to mind, as backing up a much larger body of reasoning-by-analogy work, like Omohundro (2008). But the main thing I had in mind when making that comment, frankly, was the extensive literature on kings and empires. In broader biology, many genomes/organisms (bacteria, plants, etc) will also tend to expand to consume all available resources, if you put them in an environment where they can, e.g. without balancing predators.
Clarifying inner alignment terminology

Aren't they now defined in terms of each other? 

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function  is outer aligned if all models that perform optimally on  in the limit of perfect training and infinite data are intent aligned."

2Evan Hubinger3moGood point—and I think that the reference to intent alignment is an important part of outer alignment, so I don't want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.
Clarifying inner alignment terminology

Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.

Is it something like: “... (read more)

2Evan Hubinger3moMaybe the best thing to use here is just the same definition as I gave for outer alignment—I'll change it to reference that instead.