Last year I wrote the CAST agenda, arguing that aiming for Corrigibility As Singular Target was the least-doomed way to make an AGI. (Though it is almost certainly wiser to hold off on building it until we have more skill at alignment, as a species.)
I still basically believe that CAST is right. Corrigibility still seems like a promising target compared to full alignment with human values, since there's a better story for how a near-miss when aiming towards corrigibility might be recoverable, but a near-miss when aiming for goodness could result is a catastrophe, due to the fragility of value. On top of this, corrigibility is significantly simpler and less philosophically fraught than human values, decreasing the amount of information that needs to be perfectly transmitted to the machine. Any spec, constitution, or whatever that attempts to balance corrigibility with other goals runs the risk of the convergent instrumental drives towards those other goals washing out the corrigibility. My most recent novel is intended to be an introduction to corrigibility that's accessible to laypeople, featuring a CAST AGI as a main character, and I feel good about what I wrote there.
But I'm starting to feel like certain parts of CAST are bad, or at least need a serious update. Recent conversations with @Jeremy Gillen, @johnswentworth, and @Towards_Keeperhood have led me to notice some flaws that I'd like to correct.
Very briefly, they are:
I've gone back and included warnings to my original sequence, pointing here. I apologize to anyone who I ended up misleading; science is the project of becoming less wrong over time, not somehow getting everything right on the first try.
The rest of this post will be me addressing each flaw, in turn.
In the formalism I cooked up, I proposed that we could get something like corrigibility by maximizing "power," which I defined as:
Apologies for the verbose notation. The basic story is that for some given setting of the environment (x), we define someone's power as the expected increase in value if their action is chosen according to their values, compared to the world where their action stems from some other random set of values. And notably, we treat "their values" as counterfactually drawn from some fixed distribution Q, rather than being based on what the informed probability distribution P says their values are.
The big issue is the second term, with the minus sign. A heuristic that I have upvoted, in light of realizing my error, is to be extremely cautious when putting a minus sign in front of a term that is meant to reflect the human's values, even if they're counterfactual values. My idea was to set a baseline, such that "zero power" was a sensible theoretical notion that matched my intuitive sense of what it meant to be powerless.[1]
To make the issue more crisp, let's make the simplifying assumption that the human's actions deterministically reveal the values that were upstream of those actions. This allows us to abuse notation and simply substitute v for a and v' for a', and drop the P(A|x,_) expectations. And furthermore, let's suppose that the AI, which lives in the environment x, has enough power to set the distribution over the value-relevant Domain however it wants, allowing us to substitute a deterministic function f for the sampling of the d and d' variables.
When thinking about how to choose f to maximize this expectation, I reasoned that, for the positive term, we can simply look at the second argument, v, which tells f what to optimize for. For the negative term, since it only has access to v', it wouldn't be able to pessimize very hard against v in particular. And if it didn't optimize according to its second argument (v') then it would run the risk of incidentally hurting its behavior in the positive term. Thus the AI would set things up to maximize for the preferences revealed by the human's actions, effectively becoming obedient and deferential, even in the limit of perfect knowledge.
But this is incorrect. Consider a choice for f that first compares its two arguments and sees whether the human's values line up with their revealed preferences; if they don't match, the AI optimizes according to the second argument (i.e. satisfying the human's revealed preferences); if they do match, it makes the universe as terrible as possible according to the widest swath of possible values, as measured by Q. By algebra, if v=v', the value in the expectation is always 0, so this general pessimizing never counts against the agent. The only time when we're enforcing obedience is when it's likely to be in a counterfactual. It's true that the AI can't pessimize against the "true" value of v, but it can still definitely pessimize against the most likely values for v.
In deployment, we should expect our actions to line up with our values, thus triggering the "ruin the universe for as many as possible" behavior. An easy way to ruin things is to self-modify to become incorrigible, and then convert as much of the universe into garbage as possible. If a being with typical values would recoil from something, the AI will want to do that.
There are several ways to imagine patching the formalism to remove this issue, with various degrees of realism. But first I want to say something like: the most important update is to not trust my math. I gave a blanket warning about this in my sequence, and I'm glad I did. If you're disposed to pessimism, perhaps it's a sign that even rather obvious formal issues can go unnoticed, and that it's easy for people to fool themselves that something is safe. It's also (very weak) evidence that s-risks from sign errors are real.
Beyond that, the obvious patch, if my formalism is worth saving, is to simply get rid of the negative term, and maximize the average expected value.
Restructuring it this way makes it more attractive for the AI to optimize things according to typical/simple values if the human's action doesn't sharply identify their revealed preferences. This seems bad. If most values want the AI to accumulate power for their ends, and the human's words are ambiguous, the AI might (incorrigibly) accumulate power until the ambiguity is resolved. The AI would still have an incentive to give the human an interface that allows them to clearly articulate their preferences (thus revealing them and allowing the AI to optimize for them in particular). But still, it seems worrying.
Another change that could be made is to try to cut the causal arrow[2] between the human's values and the AI's action by maximizing according to another layer of counterfactuality:
Here we're comparing the counterfactual where the action happens to randomly line up with our reference values with the counterfactual where the action is associated with a different set of random values. While I believe this prevents the AI from getting anything useful from the Value node, it makes things even more complex, keeps the minus sign around, and doesn't feel like it's cleanly capturing the semantic notion of power.
Overall I think it's most appropriate to halt, melt, and catch fire, discarding my proposed formalism except to use as a cautionary tale.
(I was convinced of this by @Jeremy Gillen.)
A lot of the hope for CAST routes through the idea that a less-than-perfectly corrigible agent might, in controlled settings, go off the rails in ways that are nonetheless recoverable. For instance, if the agent starts thinking on the object-level about how to subvert the principal, a metacognitive process might notice this scheming, leave a note for the developers, and shut the agent down. That's not guaranteed to work, but it's better than if you aren't trying to get your agent to have metacognitive processes like that (or whatever).
If we imagine an abstract space, we can imagine placing various agents closer or farther from "perfect corrigibility" according to some metric that captures the notion of a "near miss" or "imperfect, but somewhat corrigible." If we succeed at noticing the agent's flaws, there's a chance we could update the agent (perhaps with its help) towards being more truly corrigible, theoretically moving towards the location in the space that corresponds to "perfect corrigibility."
One way to visualize this space is to project it into 2d and imagine that the third dimension represents something like "potential energy" — the tendency to lead towards a more stable state (such as perfection). Paul Christiano describes it like this:
As a consequence, we shouldn’t think about alignment as a narrow target which we need to implement exactly and preserve precisely. We’re aiming for a broad basin, and trying to avoid problems that could kick out of that basin.
But note that the abstract space where a "near miss" makes sense is extremely artificial. A more neutral view of mindspace might arrange things such that small changes to the mind (e.g. tweaks to the parameters of the software that instantiates it) necessarily result in small movements in mindspace. But this is a different view of mindspace; there is no guarantee that small changes to a mind will result in small changes in how corrigible it is, nor that a small change in how corrigible something is can be achieved through a small change to the mind!
As a proof of concept, suppose that all neural networks were incapable of perfect corrigibility, but capable of being close to perfect corrigibility, in the sense of being hard to seriously knock off the rails. From the perspective of one view of mindspace we're "in the attractor basin" and have some hope of noticing our flaws and having the next version be even more corrigible. But in the perspective of the other view of mindspace, becoming more corrigible requires switching architectures and building an almost entirely new mind — the thing that exists is nowhere near the place you're trying to go.
Now, it might be true that we can do something like gradient descent on corrigibility, always able to make progress with little tweaks. But that seems like a significant additional assumption, and is not something that I feel confident is at all true. The process of iteration that I described in CAST involves more deliberate and potentially large-scale changes than just tweaking the parameters a little, and with big changes like that I think there's a big chance of kicking us out of "the basin of attraction."
(I was convinced of this by a combination of talking to @Jeremy Gillen and @johnswentworth. I ran this post by Jeremy and he signed off on it, but I'm not sure John endorses this section.)
Suppose you're engineering a normal technology, such as an engine. You might have some theoretical understanding of the basic story of what the parts are, how they fit together, and so on. But when you put the engine together, it makes a coughing-grinding noise. Something is wrong!
What should you do? Perhaps you take it apart and put it back together, in case you made a mistake. Perhaps you try replacing your parts, one-by-one. Perhaps you try to feel or listen for what's making the noise, and see if something is rubbing or loose. After some amount of work, the thing stops making the noise. Victory!
Or is it? The noise has gone away, but are you sure that the problem is also gone?
Normal engineering works, in large part, because in addition to being able to look for proxies that something is off, like a worrying noise, we can actually deploy the machine and see if it works in practice. In fact, it's almost always through deployment that we come to learn what proxies/feedback loops are relevant in the first place. But of course, in the case of AGI, an actual deployment could result in irreversible catastrophe.
The main other option is theory. If you have a theory of engines, you might be able to guess, a priori, that the engine needs to not make too much noise, or shake too much, or get too hot. Theory lets you know to check the composition of a combustion engine's exhaust to see whether the air:fuel ratio is off. Theory lets you model what sorts of loads and stresses the engine will come under when it's deployed, without having to actually deploy it.
If you have a rich (and true) theory of how minds work, I think there is hope that the iterative story of pre-deployment testing can help you get to a corrigible AGI. That theory can guide the eyes and hands of the engineers, allowing them to find feedback loops and hold onto them even as they start optimizing away the signals as a side-effect of optimizing away the problems.
In short, an attention to corrigibility might be able to buy room for iterative development if combined with a rich theoretical understanding of what sort of distributional shifts are likely to occur in deployment and what sort of stresses those are likely to bring. In the ignorant state of affairs we currently find ourselves in, I think there's very little hope that those trying to build a corrigible AI will know what to look for, and will only be able to effectively optimize for something that looks increasingly safe, without actually being safe.
In my defense, none of the other researchers who read my work seemed to notice the issue, either. I stumbled upon the issue while thinking about a toy problem I was working through at @Towards_Keeperhood's request.
In my anti-defense, my math wasn't scrutinized by that many people. As far as I know, no other MIRI researcher gave my formalism a serious vetting.
We can't actually cut the causal arrow because it wouldn't typecheck. The policy needs to fit into the actual world, so if we make the assumption that it's selected according to a model where there isn't any causal link and then reality does have a causal link, we can't know what will actually happen. Another way to think about it is that we have to teach the AI to ignore the human's values, rather than lean on an unjustified assumption that it can't tell what the human wants (except through observing the human's action).