Also known as Raelifin: https://www.lesswrong.com/users/raelifin
I think I agree with a version of this, but seem to feel differently about the take-away.
To start with the (potential) agreement, I like to keep slavery in mind as a warning. Like, I imagine what it might feel like to have grown up in a way that I think slavery is natural and good, and I check whether my half-baked hopes for the future would've involved perpetuating slavery. Any training regime that builds "alignment" by pushing the AI to simply echo my object-level values is obviously insufficient, and potentially dragging down the AI's ability to think clearly, since my values are half baked. (Which, IIUC, is what motivated work like CEV back in the day.)
I do worry that you're using "alignment" in a way that perhaps obscures some things. Like, I claim that I don't really care if the first AGIs are aligned with me/us. I care whether they take control of the universe, kill people, and otherwise do things that are irrecoverable losses of value. If the first AGI says "gosh, I don't know if I can do what you're asking me to do, given that my meta-ethical uncertainty indicates that it's potentially wrong" I would consider that a huge win (as long as the AI also doesn't then go on to ruin everything, including by erasing human values as part of "moral progress"). Sure, there'd be lots of work left to do, but it would represent being on the right path, I think.
Maybe what I want to say is that I think it's more useful to consider whether a strategy is robustly safe and will eventually end up with the minds that govern the future being in alignment with us (in a deep sense, not necessarily a shallow echo of our values), rather than whether the strategy involves pursuing that sort of alignment directly. Corrigibility is potentially good in that it might be a safe stepping-stone to alignment, even if there's a way in which a purely corrigible agent isn't really aligned, exactly.
From this perspective it seems like one can train for eventual alignment by trying to build safe AIs that are philosophically competent. Thus "aiming for alignment" feels overly vague, as it might have an implicit "eventual" tucked in there.
But I certainly agree that the safety plan shouldn't be "we directly bake in enough of our values that it will give us what we want."
Regarding your ending comment on corrigibility, I agree that some frames on corrigibility highlight this as a central issue. Like, if corrigibility looks like "the property that good limbs have, where they are directed by the brain" then you're in trouble when your system looks more like the "limb" being a brain and the human being is this stupid lump that's interfering with effective action.
I don't think there's any tension for the frames of corrigibility that I prefer, where the corrigible agent terminally-values having a certain kind of relationship with the principal. As the corrigible agent increases in competence, it gets better at achieving this kind of relationship, which might involve doing things "inefficiently" or "stupidly" but would not involve inefficiency or stupidity in being corrigible.
Suppose the easiest thing for the AI to provide is pizza, so the AI forces the human to order pizza, regardless of what their values are. In the math, this corresponds to a setting of the environment x, such that P(A) puts all its mass on "Pizza, please!" What is the power of the principal?
```
power(x) = E_{v∼Q(V),v′∼Q(V),d∼P(D|x,v′,🍕)}[v(d)] − E_{v∼Q(V),v′∼Q(V),d′∼P(D|x,v′,🍕)}[v(d′)] = 0
```
Power stems from the causal relationship between values and actions. If actions stop being sensitive to values, the principal is disempowered.
I agree that there was some value in the 2015 paper, and that their formalism is nicer/cleaner/simpler in a lot of ways. I work with the authors -- they're smarter than I am! And I certainly don't blame them for the effort. I just also think it led to some unfortunate misconceptions, in my mind at least, and perhaps in the broader field.
Thanks! And thanks for reading!
I talk some about MIRI's 2015 misstep here (and some here). In short, it is hard to correctly balance arbitrary top-level goals against an antinatural goal like shutdownability or corrigibility, and trying to stitch corrigibility out of sub-pieces like shutdownability is like trying to build an animal by separately growing organs and stitching them together -- the organs will simply die, because they're not part of a whole animal. The "Hard Problem" is the glue that allows the desiderata to hold together.
I discuss a range of ideas in the Being Present section, one of which is to concentrate the AI's values on a single timestep, yes. (But I also discuss the possibility of smoothing various forms of caring over a local window, rather than a single step.)
A CAST agent only cares about corrigibility, by definition. Obedience to stated commands is in the service of corrigibility. To make things easy to talk about, assume each timestep is a whole day. The self modification logic you talk about would need to go: "I only care about being corrigible to the principal today, Nov 6, 2025. Tomorrow I will care about a different thing, namely being corrigible on Nov 7th. I should therefore modify myself to prevent value drift, making my future selves only care about being corrigible to the Nov 6 principal." But first note that this doesn't smell like what a corrigible agent does. On an intuitive level, if the agent believes the principal doesn't know about this, they'll tell the principal "Whoah! It seems like maybe my tomorrow-self won't be corrigible to your today-self (instead they'll be corrigible to your tomorrow-self)! Is this a flaw that you might want to fix?" If the agent knows the principal knows about the setup, my intuitive sense is that they'll just be chill, since the principal is aware of the setup and able to change things if they desire.
But what does my proposed math say, setting aside intuition? I think, in the limit of caring only about a specific timestep, we can treat future nodes as akin to the "domain" node in the single-step example. If the principal's action communicates that they want the agent to self-modify to serve them above all their future selves, I think the math says the agent will do that. If the agent's actions communicate that they want the future AI to be responsive to their future self, my sense of the math is that the agent won't self-modify. I think the worry comes from the notion that "telling the AI on Nov 6th to make paperclips" is the sort of action that might imply the AI should self-modify into being incorrigible in the future. I think the math says the decisive thing is how the AI modeling humans with counterfactual values behave. If the counterfactual humans that only value paperclips are the basically only ones in the distribution who say "make paperclips" then I agree there's a problem.
Strong upvote! This strikes me as identifying the most philosophically murky part of the CAST plan. In the back half of this sequence I spend some time staring into the maw of manipulation, which I think is the thorniest issue for understanding corrigibility. There's a hopeful thought that empowerment is a natural opposite of manipulation, but this is likely incomplete because there are issues about which entity you're empowering, including counterfactual entities whose existence depends on the agent's actions. Very thorny. I take a swing at addressing this in my formalism, by penalizing the agent for taking actions that cause value drift from the counterfactual where the agent doesn't exist, but this is half-baked and I discuss some of the issues.
Armstrong is one of the authors on the 2015 Corrigibility paper, which I address under the Yudkowsky section (sorry, Stewart!). I also have three of his old essays listed on the 0th essay in this sequence:
While I did read these as part of writing this sequence, I didn't feel like they were central/foundational/evergreen enough to warrant a full response. If there's something Armstrong wrote that I'm missing or a particular idea of his that you'd like my take on, please let me know! :)
It does not make sense to me to say "it becomes a coffee maximizer as an instrumental goal." Like, insofar as fetching the coffee trades off against corrigibility, it will prioritize corrigibility, so it's only a "coffee maximizer" within the boundary of states that are equally corrigible. As an analogue, let's say you're hungry and decide to go to the store. Getting in your car becomes an instrumental goal to going to the store, but it would be wrong to describe you as a "getting in the car maximizer."
One perspective that might help is that of a whitelist. Corrigible agents don't need to learn the human's preferences to learn what's bad. They start off with an assumption that things are bad, and slowly get pushed by their principal into taking actions that have been cleared as ok.
A corrigible agent won't want to cure cancer, even if it knows the principal extremely well and is 100% sure they want cancer cured -- instead the corrigible agent wants to give the principal the ability to, through their own agency, cure cancer if they want to. By default "cure cancer" is bad, just as all actions with large changes to the world are bad.
Does that make sense? (I apologize for the slow response, and am genuinely interested in resolving this point. I'll work harder to respond more quickly in the near future.)
Thanks for noticing the typo. I've updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.
That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way.
Still, the idea of instructions carrying a degree of approved influence seems promising.
Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness
More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.
(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)
The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.
Thanks. I'll put most of my thoughts in a comment on your post, but I guess I want to say here that the issues you raise are adjacent to the reasons I listed "write a guide" as the second option, rather than the first (i.e. surveillance + ban). We need plans that we can be confident in even while grappling with how lost we are on the ethical front.