I'm not really thinking about provable guarantees per se. I'm just thinking about how to point to the AI's concept of human values - directly point to it, not point to some proxy of it, because proxies break down etc.
(Rough heuristic here: it is not possible to point directly at an abstract object in the territory. Even though a territory often supports certain natural abstractions, which are instrumentally convergent to learn/use, we still can't unambiguously point to that abstraction in the territory - only in the map.)
A proxy is probably good enough for a lot of applications with little scale and few corner cases. And if we're doing something like "train the AI to follow your instructions", then a proxy is exactly what we'll get. But if you want, say, an AI which "tries to help" - as opposed to e.g. an AI which tries to look like it's helping - then that means pointing directly to human values, not to a proxy.
Now, it is possible that we could train an AI against a proxy, and it would end up pointing to actual human values instead, simply due to imperfect optimization during training. I think that's what you have in mind, and I do think it's plausible, even if sounds a bit crazy. Of course, without better theoretical tools, we still wouldn't have a way to directly check even in hindsight whether the AI actually wound up pointing to human values or not. (Again, not talking about provable guarantees here, I just want to be able to look at the AI's own internal data structures and figure out (a) whether it has a notion of human values, and (b) whether it's actually trying to act in accordance with them, or just something correlated with them.)
Yes, I would call that the central problem. (Though it would also be fine to build a pointer to a human and have the AI "help the human", without necessarily pointing to human values.)
How would we do either of those things without workable theory of embedded agency, abstraction, some idea of what kind-of-structure human values have, etc?
I'm going to start a fresh thread on this, it sounds more interesting (at least to me) than most of the other stuff being discussed here.
No. The whole notion of a human "meaning things" presumes a certain level of abstraction. One could imagine an AI simply reasoning about molecules or fields (or at least individual neurons), without having any need for viewing certain chunks of matter as humans who mean things. In principle, no predictive power whatsoever would be lost in that view of the world.
That said, I do think that problem is less central/immediate than the problem of taking an AI which does know what we mean, and pointing at that AI's concept-of-what-we-mean - i.e. in order to program the AI to do what we mean. Even if an AI learns a concept of human values, we still need to be able to point to that concept within the AI's concept-space in order to actually align it - and that means translating between AI-notion-of-what-we-want and our-notion-of-what-we-want.
but why must it have a confusing interface? Couldn't you just talk to it, and it would know what you mean?
That's where the Don Norman part comes in. Interfaces to complicated systems are confusing by default. The general problem of systematically building non-confusing interfaces is, in my mind at least, roughly equivalent to the full technical problem of AI alignment. (Writing a program which knows what you mean is also, in my mind, roughly equivalent to the full technical problem of AI alignment.) A wording which makes it more obvious:
Something like e.g. tool AI puts more of the translation burden on the human, rather than on the AI, but that doesn't make the translation itself any less difficult.
In a non-foomy world, the translation doesn't have to be perfect - humanity won't be wiped out if the AI doesn't quite perfectly understand what we mean. Extreme capabilities make high-quality translation more important, not just because of Goodhart, but because the translation itself will break down in scenarios very different from what humans are used to. So if the AI has the capabilities to achieve scenarios very different from what humans are used to, then that translation needs to be quite good.
Though I still have questions, like, how exactly did someone stumble upon the correct mathematical principles underlying intelligence by trial and error?
You mentioned that, conditional on foom, you'd be confused about what the world looks like. Is this the main thing you're confused about in foom worlds, or are there other major things too?
Cool, just wanted to make sure I'm engaging with the main argument here. With that out of the way...
Looking at both the first and third point, I suspect that a sub-crux might be expectations about the resource requirements (i.e. compute & data) needed for AGI. I expect that, once we have the key concepts, human-level AGI will be able to run in realtime on an ordinary laptop. (Training might require more resources, at least early on. That would reduce the unilateralist problem, but increase the chance of decisive strategic advantage due to the higher barrier to entry.)
EDIT: to clarify, those second two points are both conditioned on foom. Point being, the only thing which actually matters here is foom vs no foom:
Part of the problem is that we have a really strong unilateralist's curse. It only takes 1, or a few people who don't realize the problem to make something really dangerous.
This is a foom-ish assumption; remember that Rohin is explicitly talking about a non-foom scenario.
While I might agree with the three options at the bottom, I don't agree with the reasoning to get there.
Abstractions are pretty heavily determined by the territory. Humans didn't look at the world and pick out "tree" as an abstract concept because of a bunch of human-specific factors. "Tree" is a recurring pattern on earth, and even aliens would notice that same cluster of things, assuming they paid attention. Even on the empathic front, you don't need a human-like mind in order to notice the common patterns of human behavior (in humans) which we call "anger" or "sadness".