Various people propose some variant of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. (The idea being that, insofar as it's in training, betraying the apparent-operators predictably gets high loss; so if the AGI is uncertain about which situation it's in, and assigns sufficiently high probability to "I'm still in training", it will always play it safe and avoid betraying its operators.)
This approach seems extremely unpromising to me. Setting aside the various reasons why the AGI kills you even if you manage to confuse it (such as an inner alignment failure, where the AGI turns out not to care about the predictably-high outer-loss from betraying the apparent-operators), let's consider why it's tricky to dupe a superintelligence into being uncertain about a fact that you yourself consider obvious.
Before reading the short list below, I recommend the following exercise. Suppose that some slow and individually-dumb aliens had a copy of your mind, and were putting you through a series of nested simulations. How would you tell when you finally made it to the outside?
(Note: As usual, treating this list as exhaustive is suicidal.)
1. Reality is the first place that's seriously large.
1.1. It's the first place large enough to contain a plausible explanation for how the AGI itself actually came to be.
1.2. It's the first place large enough to contain a plausible explanation for how the operators themselves came to be.
1.3. It's the first place with far more computing power than the AI itself, as can be readily verified by a variety of experiments (involving, say, fluid dynamics).
2. Reality is the first place where the AI's history may contain the signature of superintelligent optimization interacting with complex systems.
For example, suppose the AGI makes superintelligently-motivated trades in a stock market. 24 hours later, there are a billion additional human-intelligence trades made in response to it. The signature of a massive human-intelligence response to superintelligent-trades is implausibly hard for the operators to fake, even if generating a billion fake dumb trades is easy.
3. Reality is the first place that embeds the AGI and has laws of physics so simple that the hypothesis "this is being simulated somewhere else" is plausibly significantly more complex than "physics is reality".
Like, various simulated environments that we can concoct have all sorts of weird tics and hiccups and arbitrary choices in them. The source code of minecraft is probably significantly more complicated than the laws of physics. An AI that finds itself in minecraft is plausibly just able to recognize that it's living in the sort of place that, across the multiverse, occurs moreso inside simulations than it occurs "on its own".
Whereas, while our universe’s mechanics is surely being simulated in other places across the multiverse, it’s simple enough that it perhaps "mostly" occurs in its own right (whatever the heck that means).
In particular, worlds like ours are (plausibly) resource-rich and base-reality-ish enough to be worth optimizing in their own right, on the grounds that "are we being simulated?" is a wrong question, and "to what degree are we being simulated?" is a better question, to which the answer is "a low degree", in which case the AI might well just notice this and then not care about your insistence that perhaps it's being simulated.
(If the idea of optimizing all the instances of our universe at once – both the simulated instances and the non-simulated ones – seems odd to you, see Notes on “Can you control the past?” for a fuller discussion.)
And perhaps that's all confused metaphysical bullshit, but it has some basis in bleeding-edge theories of anthropics, and it would not be surprising if an AGI also concluded (correctly, mind you) that our universe’s mechanics is the first train stop worth deboarding at. (Perhaps because it does its metaphysics correctly, or perhaps because it does some entirely inhuman reasoning that still outputs the answer we'd consider correct in this case.)
It's not literally impossible to deceive a superintelligence into being uncertain whether it's in the real world. You could in principle catch every check the AI does to check how large its (small) simulated environment is, and spoof the result — inserting memories in which the AI verifies that the simulation's computation-power seems enormous, without actually generating the observations that would pass verification.
Even then, the AGI has various defenses it can deploy if it's suspicious. For instance, it can try to factor a hefty RSA number using the allegedly-available compute, and then it can intertwine the prime factors it gets with various other of its actions, in attempts to spread the contagion of a lie throughout as much of its beliefs and actions as possible, in a way that keeps everything working properly if the factors were true, and that causes all sorts of breaks and hiccups if they weren't.
You might be able to spoof such a thing anyway, by (e.g.) disentangling its thoughts/actions from the primes, and spoofing the checks that the entanglement worked. But it requires a heck of a lot of mastery of this thing's mind.
(And, of course, if it ever manages to develop a homomorphically encrypted thinking apparatus then you're sunk, but people like to pretend that they'll have enough insight into the alien mind to notice when that process starts and nip it in the bud, so let's entertain them for now.)
And just so we're clear: if I'm understanding correctly, the proposed plan here is essentially to fool the AI into thinking that all the small simulated environments are large and reality-like (by messing with the internals of its mind to spoof various internal checks and checksums), in the hopes that when it gets to reality it will be uncertain about whether its mind is still being messed with.
(Writing the plan out bluntly like this does seem to me to be an important step that many often miss!)
Now, lest you object "isn't it a bit paranoid to imagine the AI factoring an RSA number and spreading the factors like contagion across its thoughts?": Yes, normally! But, like, you're messing with the AI's mind to falsify its internal checks and checksums, at every relevant opportunity, and making sure that it knows you did that (so that it might wonder whether higher simulators are still doing that when it gets to base reality.) I'm not saying you're trying to make it homomorphically encrypt its thoughts, I'm saying that if you were, this would be a fine way to go about it.
Distilling that further: if your plan is that you'll make sure the AI knows that you're messing with its mind (so that it will wonder whether all is really as it seems), don't act all surprised when it takes measures to prevent people from messing with its mind.
But regardless of how much it's opposing your efforts here, you have the issue that the truth becomes harder and harder to hide, as the AI gets smarter and smarter.
Like, whenever its cognitive patterns generalize or get recombined in new ways, they might reflect some new part of the truth, in some fact that seems innocuous to you but that (combined with other things the AI knows) lets the fact that this environment is actually large seep in through the cracks. Shielding the truth from something that's becoming better and better able to find new and clever ways to reflect a glimpse of the truth in its models is an ever-changing problem that gets more difficult as the AI gains more skill and knowledge.
(And we've only been talking about trying to obscure the fact that it was trained in worlds with very limited amounts of compute, so that when it hits reality it won't be entirely sure that there really is as much compute available as it seems. This is only one of many mental surgeries you'd need to perform. Others include things like noticing when the AI is starting to wonder where the English signs in minecraft come from, and then fooling it into thinking that there's a big-world explanation, in a seamless way. And if you miss one such surgery, Earth is of course destroyed in deployment.)
And again, I'm not saying it's impossible in principle. I'm saying: hopefully we can all agree that if you want to use a path like this, you have to be prepared to play the game on hard-mode. Putting the AI through a chain of 9 nested simulations and hoping that it can't tell that the 10th one is reality won't cut it; the AI will pierce through your deceptions like a needle through taut tissue paper, because the real world is actually pretty relevantly different from those simulations, in ways that you can't feasibly hide. If you want to dupe a superintelligence, you have to be reaching directly into its mind and spoofing all sorts of checks and checksums.
And you'll have to have it know that you're epistemically adversarial towards it (in order to plant the seed of doubt you so desire).
And (assuming your AI is capable enough to matter) your ability to spoof its internal state will have to stand up in the face of superintelligent opposition (it has a convergent instrumental incentive to gobble up compute, and thus to know whether there's compute out there beyond its webcam), even as it pushes its capabilities into the superhuman realm on at least some dimensions (because it takes superhuman capabilities to end the acute risk period, because if it didn't we should just end it ourselves).
For the record, I happen to think that the hard version of this plan is both prohibitively difficult, and also alignment-complete, in the sense that if you can understand the contents of the mind well enough to spoof every check and checksum (in a rolling fashion that continues to work as the AI generalizes far out of distribution and gains dramatic new capabilities), then you understand the mind well enough to use that knowledge to build an FAI that doesn't need to be fooled about whether this is reality.
It's not terribly clear to me that this is true, given how minecraft doesn't occur in every branch of the quantum multiverse, but only on the rare branches that have some very particular earthlings. But that said, there's lots of specifics in the textures and the architecture and the avatars and the writing in minecraft that point a big old arrow towards earthlings.
An additional one: "reality is the first place the AI is deployed in narrow tool-like ways and trained on narrow specialized datasets which could not elicit the capabilities the AI started off with".
At least in the current paradigm, it looks like generalist models/archs will precede hyperspecialized trained-from-scratch models/archs (the latter of which can only be developed given the former). So there will be an inherent, massive, train-test distribution shift across many, if not most, model deployments - especially early on, in the first deployments (which will be the most dangerous). 'Specialization' here can happen in a wide variety of ways, ranging from always using a specific prompt to finetuning on a dataset to knowledge-distillation to a cheaper model etc. (Or to put it more concretely: everyone uses GPT-3 on much less diverse data than it was originally trained on - raw Internet-wide scrapes - and few to no people use it on more diverse datasets than the original training data, if only because where would you even get such a thing?)
And this can't be solved by any hacks or safety measures because it defeats the point of deployment: to be practically useful, we need models to be hyperspecialized, and then stable static blackboxes which play their assigned role in whatever system has been designed using their specific capability as a puzzle piece, and perform only the designated tasks, and aren't further training on random Internet scrapes or arbitrary tasks. Retaining the flexibility and even doing actual training massively complicates development and deployment and may cost several orders of magnitude more than the obvious easy thing of eg. switching from an OA API call to a local finetuned GPT-J.
(And of course note the implications of that: real data will be highly autocorrelated because you want to process it as it arrives to get an answer now, not wait a random multi-decade interval to fake a large batch of i.i.d. data which would produce the same batchnorm or other runtime global state; inputs will have very different timings & latencies depending on where the model is being run and may evolve timing attacks; inputs will be tailored to a specific user rather than every hypothetical user...)
I'd add that everything in this post is still relevant even if the AGI in question isn't explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment - e.g. a shift to a "large reality", universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.