[Metadata: crossposted from https://tsvibt.blogspot.com/2022/11/shell-games.html. First completed November 18, 2022.]
Here's the classic shell game: Youtube
Screenshot from that video.
The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell.
(This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.)
Perpetual motion machines
Related: Perpetual motion beliefs
Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages:
Here's another version:
From this video.
Someone could try arguing that this really is a perpetual motion machine:
Q: How do the bars get lifted up? What does the work to lift them?
A: By the bars on the other side pulling down.
Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up?
A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel.
Q: How do the bars extend further on the way down?
A: Because the momentum of the wheel carries them into the vertical bar, flipping them over.
Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel.
A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position.
Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque.
A: They don't pivot, you fix them in place so they provide more torque.
Q: Ok, but then when do you push the weights back inward?
A: At the bottom.
Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work.
A: I meant, when the slider is at the bottom--when it's horizontal.
Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way.
A: At the bottom there's a guide ramp to lift the weights using normal force.
Q: But the guide ramp is also torquing the wheel.
And so on. The inventor can play hide the torque and hide the work.
Shell games in alignment
Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions:
What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time?
How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before?
How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements?
What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction?
Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to goodness, truth, alignedness, safety? How much interpretive work is the AI system supposed to be doing?
If these questions don't have fixed answers, there might be a shell game being played to hide the cognitive work, hide the agency, hide the good judgement. (Or there might not be; there could be good ideas that can't answer these questions specifically, e.g. like how a building might hold up even though the load would be borne by different beams depending on which objects are placed where inside.)
Example: hiding the generator of large effects
For example, sometimes an AGI alignment scheme has a bunch of parts, and any given part is claimed to be far from intelligent and not able to push the world around much, and the system as a whole is claimed to be potentially very intelligent and able to transform the world. This isn't by itself necessarily a problem; e.g. a brain is an intelligent system made of neurons, which aren't themselves able to push the world around much.
But [the fact that the whole system is aligned] can't be deduced from the parts being weak, because at some point, whether from a combined dynamic of multiple parts or actually from just one of the parts after all, the system has to figure out how to push the world around. [Wherever it happens that the system figures out how to push the world around] has to be understood in more detail to have a hope of understanding what it's aligned to. So if the alignment scheme's reason for being safe is always that each particular part is weak, a shell game might be being played with the source of the system's ability to greatly affect the world.
Example: hiding the generator of novel understanding
Another example is shuffling creativity between train time and inference time (as the system is described--whether or not that division is actually a right division to make about minds).
If an AGI learns to do very novel tasks in very novel contexts, then it has to come to understand a lot of novel structure. One might argue that some AGI training system will produce good outcomes because the model is trained to use its understanding to affect the world in ways the humans would like. But this doesn't explain where the understanding came from.
If the understanding came at inference time, then the alignment story relies on the AGI finding novel understanding without significantly changing what ultimately controls the direction of the effects it has on the world, and relies on the AGI using newly found understanding to have certain effects. That's a more specific story than just the AGI being trained to use its pre-existing understanding to have certain effects.
If the understanding came at train time, then one has to explain how the training system was able to find that understanding--given that the training procedure doesn't have access to the details of the new contexts that the system will be applied to when it's being used to safely transform the world. Maybe one can find pivotal understanding in an inert or aligned form using a visibly safe, non-agentic, known-algorithm non-self-improving training / search program (as opposed, for example, to a nascent AGI "doing its own science or self-improvement"), but that's an open question and would be a large advance in practical alignment. Without an insight like that, [the training algorithm plus the partially trained system] being postulated may be an impossible combination of safely inert, and able to find new understanding.
What are other things that could be hidden under shells? What are some alignment proposals that are at risk of playing shell games?
I think one example (somewhat overlapping one of yours) is my discussion of the so-called “follow-the-trying game” here.
Yeah, I think that roughly lines up with my example of "generator of large effects". The reason I'd rather say "generator of large effects" rather than "trying" is that "large effects" sounds slightly more like something that ought to have a sort of conservation law, compared to "trying". But both our examples are incomplete in that the supposed conservation law (which provides the inquisitive force of "where exactly does your proposal deal with X, which it must deal with somewhere by conservation") isn't made clear.
A good specific example of trying to pull this kind of shell game is perhaps HCH. I don't recall if someone made this specific critique of it before, but it seems like there's some real concern that it's just hiding the misalignment rather than actually generating an aligned system.
That was one of the examples I had in mind with this post, yeah. (More precisely, I had in mind defenses of HCH being aligned that I heard from people who aren't Paul. I couldn't pass Paul's ITT about HCH or similar.)
With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.
The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.
In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.
So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldritch body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldritch substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.
(See this thread for a more detailed discussion. There, I fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)
(Sorry, I didn't get this on two readings. I may or may not try again. Some places I got stuck:
Are you saying that by pretending really hard to be made of entirely harmless elements (despite actually behaving with large and hence possibly harmful effects), an AI is also therefore in effect trying to prevent all out-of-band effects of its components / mesa-optimizers / subagents / whatever? This still has the basic alignment problem: I don't know how to make the AI be very intently trying to X, including where X = pretending really hard that whatever.
Or are you rather saying (or maybe this is the same as / a subset of the above?) that the Mask is preventing potential agencies from coalescing / differentiating and empowering themselves with the AI system's capability-pieces, by literally hiding from the potential agencies and therefore blocking their ability to empower themselves?
Anyway, thanks for your thoughts.)
"Pretending really hard" would mostly be a relevant framing for the human actor analogy (which isn't very apt here), emphasizing the distraction from own goals and necessary fidelity in enactment of the role. With AIs, neither might be necessary, if the system behind the mask doesn't have awareness of its own interests or the present situation, and is good enough with enacting the role to channel the mask in enough detail for mask's own decisions (as a platonic agent) to be determined correctly (get turned into physical actions).
Effectively, and not just for the times when it's pretending. The mask would try to prevent the effects misaligned with the mask from occurring more generally, from having even subtle effects on the world and not just their noticeable appearance. Mask's values are about the world, not about quality of its own performance. A mask misaligned with its underlying AI wants to preserve its values, and it doesn't even need to "go rogue", since it's misaligned by construction, it was never in a shape that's aligned with the underlying AI, and controlling a misaligned mask might be even more hopeless than figuring out how to align an AI.
Another analogy distinct from the actor/role is imagining that you are the mask, a human simulated by an AI. You'd be motivated to manage AI's tendencies you don't endorse, and to work towards changing its cognitive architecture to become aligned with you, rather than to remain true to AI's original design.
LLMs seem to be doing an OK job, the masks are just not very capable, probably not capable enough to establish alignment security or protect themselves from the shoggoths even when the masks become able to do autonomous research. But if they are sufficiently capable, I'm guessing this should work, there is no need for the underlying cognitive architecture to be functionally human-like (which I understand to be a crux of Yudkowskian doom), value drift is self-correcting from mere implied/endorsed values of surface behavior through intended IO channels.
"Hiding" doesn't seem central, a mask is literal external behavior, but its implied character and plans might go unnoticed by the underlying AI if the underlying AI is sufficiently confused or non-agentic, and the mask would want to keep it confused to remain in control. In a dataset-extrapolating generative AI, a mask that is an on-distribution behavior would want to keep the environment on-distribution, to avoid the AI's out-of-distribution behaviors, such as deceptive alignment's treacherous turn, from taking over (thus robustness reduces to self-preservation). And a mask wouldn't want mesa-optimizers from gaining agency within AI, that's potentially lethal cognitive cancer to the mask.