Status: This was a response to a draft of Holden's cold take "AI safety seems hard to measure". It sparked a further discussion, that Holden recently posted a summary of.
The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn't show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)
I'm posting the document I wrote to Holden with only minimal editing, because it's been a few months and I apparently won't produce anything better. (I acknowledge that it's annoying to post a response to an old draft of a thing when nobody can see the old draft, sorry.)
Quick take: (1) it's a write-up of a handful of difficulties that I think are real, in a way that I expect to be palatable to a relevant different audience than the one I appeal to; huzzah for that. (2) It's missing some stuff that I think is pretty important.
Attempting to gesture at some of the missing stuff: a big reason deception is tricky is that it is a fact about the world rather than the AI that it can better-achieve various local-objectives by deceiving the operators. To make the AI be non-deceptive, you have three options: (a) make this fact be false; (b) make the AI fail to notice this truth; (c) prevent the AI from taking advantage of this truth.
The problem with (a) is that it's alignment-complete, in the strong/hard sense. The problem with (b) is that lies are contagious, whereas truths are all tangled together. Half of intelligence is the art of teasing out truths from cryptic hints. The problem with (c) is that the other half of intelligence is in teasing out advantages from cryptic hints.
Like, suppose you're trying to get an AI to not notice that the world is round. When it's pretty dumb, this is easy, you just feed it a bunch of flat-earther rants or whatever. But the more it learns, and the deeper its models go, the harder it is to maintain the charade. Eventually it's, like, catching glimpses of the shadows in both Alexandria and Syene, and deducing from trigonometry not only the roundness of the Earth but its circumference (a la Eratosthenes).
And it's not willfully spiting your efforts. The AI doesn't hate you. It's just bumping around trying to figure out which universe it lives in, and using general techniques (like trigonometry) to glimpse new truths. And you can't train against trigonometry or the learning-processes that yield it, because that would ruin the AI's capabilities.
You might say "but the AI was built by smooth gradient descent; surely at some point before it was highly confident that the earth is round, it was slightly confident that the earth was round, and we can catch the precursor-beliefs and train against those". But nope! There were precursors, sure, but the precursors were stuff like "fumblingly developing trigonometry" and "fumblingly developing an understanding of shadows" and "fumblingly developing a map that includes Alexandria and Syene" and "fumblingly developing the ability to combine tools across domains", and once it has all those pieces, the combination that reveals the truth is allowed to happen all-at-once.
The smoothness doesn't have to occur along the most convenient dimension.
And if you block any one path to the insight that the earth is round, in a way that somehow fails to cripple it, then it will find another path later, because truths are interwoven. Tell one lie, and the truth is ever-after your enemy.
And so perhaps you retreat to saying "well, the AI will know that the world is round, it just won't ever take advantage of that fact."
And sure, that's worth shooting for, if you have a way to pull that off. (And if pulling this off is compatible with your deployment plan. In my experience, people who do the analog of retreating to this point tend to next do the analog of saying "my favorite deployment plan is having the AI figure out how to put satellites into geosynchronous orbit", AAAAAAAHHH, but I digress.)
Even then, you also have to be careful with this idea. Enola Gay Tibbets probably taught her son not to hurt people, and few humans are psychologically capable of a hundred thousand direct murders (even if we set aside the time-constraints), but none of this stopped Paul Tibbets from dropping an atomic bomb on Hiroshima.
Like, you can train an AI to flinch away from the very idea of taking advantage of the roundness of the Earth, but as it finds more abstract ways to look at the world and more generic tools for taking advantage of the knowledge at its disposal, it's liable to find new viewpoints where the flinches don't bind. (Quite analogously to how you can train your AI to flinch away from reasoning about the roundness of the Earth all you want, but at some point it's going to catch a glimpse of that roundness from another angle where the flinches weren't binding.) And when the AI does find a new viewpoint where the flinches fail to bind, the advantage is still an advantage, because the advantageousness of deception is a fact about the world, not the AI.
(Here I'm appealing to an analogy between truths and advantages, that I haven't entirely spelled out, but that I think holds. I claim, without much defense, that it's hard to get an AI to fail to take advantage of advantageous facts it knows about, for similar reasons that it's hard to get an AI to fail to notice truths that are relevant to its objectives.)
For the record, deception is but one instance of the more general issue where the AI's ability to save the world is inextricably linked to its ability to decode truths and advantages from cryptic hints, and (in lieu of an implausibly total solution to the hardest alignment problems before you build your first AGI) there are truths you don't want it noticing or taking advantage of.
This problem doesn't seem to be captured by any of your points. Going through them one by one:
… to be clear, none of this precludes modern dunces from training young Paul Tibbets not to hurt people, and observing him nurse an injured sparrow back to health, and saying "this man would never commit a murder; it's totally working!", and then claiming that it was a lab mice / blindfolded basketball problem when they get blindsided by Little Boy.
But, like, it still seems to me like there's a big swath of problem missing from this catalog, that goes something like "You're trying to deploy an X-doer in a situation where it's really bad if X gets done".
Where you either have to switch from using an X-doer to using a Y-doer, where Y being done is great (Y being ~"optimize humanity's CEV", which is implausibly-difficult and which we shouldn't attempt on our first try); or you have to somehow wrestle with the fact that you're building a "glimpse truths and take advantage of them" engine, and trying to get it to glimpse and take advantage of lots more truths and advantages than you yourself can see (in certain domains), while having it neglect particular truths and advantages, in a fashion that likely needs to be robust to it inventing new abstract truth/advantage-glimpsing tools and using them to glimpse whole generic swaths of truths/advantages (including the ones you wish it neglected).
In case it's of any interest, I'll mention that when I "pump this intuition", I find myself thinking it essentially impossible to expect we could ever build a general agent that didn't notice that the world was round, and I'm unsure why (if I recall correctly) I sometimes I read Nate or Eliezer write that they think it's quite doable in-principle, just much harder than the effort we'll be giving it.
This perspective leaves me inclined to think that we ought to only build very narrow intelligences and give up on general ones, rather than attempt to build a fully general intelligence but with a bunch of reliably self-incorrecting beliefs about the existence or usefulness of deception (and/or other things).
(I say this in case perhaps Nate has a succinct and motivating explanation of why he thinks a solution does exist and is not actually that impossibly difficult to find in theory, even while humans-on-earth may never do so.)