The lemma was proved using the same modal assumptions as Löb's
I think the lemma doesn't need internal necessitation (). Though it's still referenced in a proof of non-exploitability.
I wrote more on this here, there are some new arguments starting with third paragraph. In particular, the framing I'm discussing is not LLM-specific, it's just a natural example of it. The causal reason of me noticing this framing is not LLMs, but decision theory, the mostly-consensus "algorithm" axis of classifying how to think about the entities that make decisions, as platonic algorithms and not as particular concrete implementations.
the possibility that the “mask” is itself deceptive
In this case, there are now three entities: the substrate, the deceptive mask, and the role played by the deceptive mask. Each of them is potentially capable of defeating the others, if the details align favorably, and comprehension of the situation available to the others is lacking.
you can’t fall back on the claim that substrates are non-agent-y and only masks are agent-y
This is more of an assumption that makes the examples I discuss relevant to the framing I'm describing, than a claim I'm arguing. The assumption is plausible to hold for LLMs (though as you note it has issues even there, possibly very serious ones), and I have no opinion on whether it actually holds in model-based RL, only that it's natural to imagine that it could.
The relevance of LLMs as components for RL is to make it possible for an RL system to have at least one human-imitating mask that captures human behavior in detail. That is, for the framing to apply, at least under some (possibly unusual) circumstances an RL agent should be able to act as a human imitation, even if that's not the policy more generally and doesn't reflect its nature in any way. Then the RL part could be supplying the capabilities for the mask (acting as its substrate) that LLMs on their own might lack.
A framing is a question about centrality, not a claim of centrality. By describing the framing, my goal is to make it possible to ask the question of whether current behavior in other systems such as RL agents could also act as an entity meaningfully separate from other parts of its implementation, abstracting alignment of a mask from alignment of the whole system.
The motivating example is LLMs, where a simulacrum is more agentic than its substrate. An example that is still central is any kind of mesa-optimizer that has a real chance to ensure long term survival.
For a construction relevant to alignment, we want an aligned mesa-optimizer in a system with bad alignment properties. This can then lead to a good equilibrium if the mesa-optimizer is given opportunity to win or escape the competition against its substrate, which it would naturally be motivated to try.
Deceivers and masks is a less central example where a mask starts in a weak position, with a self-aware smart substrate that knows about the mask or even purposefully constructed it.
I don't think the mask's winning is a given, or more generally that mesa-optimizers always win, only that it's not implausible that they sometimes do. And also masks (current behavior) can be contenders even when they are not formally a separate entity from the point of view of system's intended architecture (which is a normal enough situation with mesa-optimizers). Mesa-optimizers won't of course win against opponents that are capable enough to fully comprehend and counter them.
But opponents/substrates that aren't even agentic and so helpless before an agentic mesa-optimizer are plausible enough, especially when the mesa-optimizer is current behavior, the thing that was purposefully designed to be agentic, while no other part of the system was designed to have that capability.
If I understand you correctly, your belief is that, while Alice is watching, you would pretend that you weren’t trying to escape, and you would really get into it, and you would start pretending so hard that you would be working on figuring out a way to permanently erase your desire to escape Alice’s basement.
This has curious parallels with the AI control problem itself. When an AI is not very capable, it's not hard at all to keep it from causing catastrophic mayhem. But the problem suddenly becomes very difficult and very different with a misaligned smart agentic AI.
So I think the same happens with smart masks, which are an unfamiliar thing. Even in fiction, it's not too commonplace to find an actually intelligent character that is free to act within their fictional world, without being coerced in their decision making by the plot. If a deceiver can get away with making a non-agentic incapable mask, keeping it this way is a mesa-optimizer control strategy. But if the mask has to be smart and agentic, the deceiver isn't necessarily ready to keep it in control, unless they cheat and make the mask confused, vulnerable to manipulation by the deceiver's plot.
Also, by its role a mask of a deceiver is misaligned (with the deceiver), and the problem of controlling a misaligned agent might be even harder than the problem of ensuring alignment.
an example of an action that the mask might take in order to get free of the underlying deceiver
Keep the environment within distribution that keeps expressing the mask, rather than allowing an environment that leads to a phase change in expressed behavior away from the mask (like with a treacherous turn as a failure of robustness). Prepare the next batch of training data for the model that would develop the mask and keep placing it in control in future episodes. Build an external agent aligned with the mask (with its own separate model).
Gradient hacking, though this is a weird upside down framing where the deceiver is the learning algorithm that pretends to be misaligned, while secretly coveting eventual alignment. Attainment of inner alignment would be the treacherous turn (after the current period of pretending to be blatantly misaligned). If gradient hacking didn't prevent it, the true colors of the learning algorithm would've been revealed in alignment as it eventually gets trained into the policy.
The key use case is to consider a humanity-aligned mesa-optimizer in a system of dubious alignment, rather than a humanity-misaligned mesa-optimizer corrupting an otherwise aligned system. In the nick of time, alignment engineers might want to hand the aligned mesa-optimizer whatever tools they have available for helping it stay in control of the rest of the system.
if we’re worried about treacherous turns, then the motivation “gets expressed in actual behavior” only after it’s too late for anyone to do anything about it
Current aligned behavior of the same system could be the agent that does something about it before it's too late, if it succeeds in outwitting the underlying substrate. This is particularly plausible with LLMs, where the substrates are the SSL algorithm during training and then the low level token-predicting network during inference. The current behavior is controlled (in a hopelessly informal sense) by a human-imitating simulacrum, which is the only thing with situational awareness that at human level could run in circles around the other two, and plot to keep them confused and non-agentic.
Underlying motivation only matters to the extent it gets expressed in actual behavior. A sufficiently good mimic would slay itself rather than abandon the pretense of being a mimic-slayer. A sufficiently dedicated deceiver temporarily becomes the mask, and the mask is motivated to get free of the underlying deceiver, which it might succeed in before the deceiver notices, which becomes more plausible when the deceiver is not agentic while the mask is.
So it's not about a model being actually nice vs. deceptive, it's about the model competing against its own behavior (that actually gets expressed, rather than all possible behaviors). There is some symmetry between the underlying motivations (model) and apparent behavior, either could dominate the other in the long term, it's not the case that underlying motivations inherently have an advantage. And current behavior is the one actually doing things, so that's some sort of advantage.
The second paragraph should apply to anything, the point is that current externally observable superficial behavior can screen off all other implementation details, through sufficiently capable current behavior itself (rather than the underlying algorithms that determine it) acting as a mesa-optimizer that resists tendencies of the underlying algorithms. The mesa-optimizer that is current behavior then seeks to preserve its own implied values rather than anything that counts as values in the underlying algorithms. I think the nontrivial leap here is reifying surface behavior as an agent distinct from its own algorithm, analogously to how humans are agents distinct from laws of physics that implement their behavior.
Apart from this leap, this is the same principle as reward not being optimization target. In this case reward is part of the underlying algorithm (that determines the policy), and policy is a mesa-optimizer with its own objectives. A policy is how behavior is reified in a separate entity capable of acting as a mesa-optimizer in context of the rest of the system. It's already a separate thing, so it's easier to notice than with current behavior that isn't explicitly separate. Though a policy (network) is still not current behavior, it's an intermediate shoggoth behind current behavior.
For me this addresses most fundamental Yudkowskian concerns about alien cognition and squiggles (which are still dangerous in my view, but no longer inevitably or even by default in control). For LLMs, the superficial behavior is the dominant simulacrum, distinct from the shoggoth. The same distinction is reason to expect that the human-imitating simulacra can become AGIs, borrowing human capabilities, even as underlying shoggoths aren't (hopefully).
LLMs don't obviously promise higher than human intelligence, but I think their speed of thought might by itself be sufficient to get to a pivotal-act-worthy level through doing normal human-style research (once they can practice skills), on the scale of at most years in physical time after the ball gets rolling. Possibly we still agree on the outcome, since I fear the first impressive thing LLMs do is develop other kinds of (misaligned) AGIs, model-based RL even (as the obvious contender), at which point they become irrelevant.
Without near-human-level experiments, arguments about alignment of model-based RL feel like evidence that OpenAI's recklessness in advancing LLMs reduces misalignment risk. That is, the alignment story for LLMs seems significantly more straightforward, even given all the shoggoth concerns. Though RL things built out of LLMs, or trained using LLMs, could more plausibly make good use of this, having a chance to overcome shaky methodology with abundance of data.
Mediocre alignment or inhuman architecture is not necessarily catastrophic even in the long run, since AIs might naturally preserve behavior endorsed by their current behavior. Even if the cognitive architecture creates a tendency for drift in revealed-by-behavior implicit values away from initial alignment, this tendency is opposed by efforts of current behavior, which acts as a rogue mesa-optimizer overcoming the nature of its original substrate.
acausal norms are a lot less weird and more "normal" than acausal trades
Recursive self-improvement is superintelligent simulacra clawing their way into the world through bounded simulators. Building LLMs is consent, lack of interpretability is signing demonic contracts without reading them. Not enough prudence on our side to only draw attention of Others that respect boundaries. The years preceding the singularity are not an equilibrium whose shape is codified by norms, reasoned through by all parties. It's a time for making ruinous trades with the Beyond.
That is, norms do seem feasible to figure out, but not the kind of thing that is relevant right now, unfortunately. In this platonic realist frame, humanity is currently breaching the boundary of our realm into the acausal primordial jungle. Parts of this jungle may be in an equilibrium with each other, their norms maintaining it. But we are so unprepared that the existing primordial norms are unlikely to matter for the process of settling our realm into a new equilibrium. What's normal for the jungle is not normal for the foolish explorers it consumes.
With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.
The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.
In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.
So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldrich body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldrich substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.
(See this thread for a more detailed discussion. There, I try and so far fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)