I wrote this post imagining "strategy-stealing assumption" as something you would assume for the purpose of an argument, for example I might want to justify an AI alignment scheme by arguing "Under a strategy-stealing assumption, this AI would result in an OK outcome." The post was motivated by trying to write up another argument where I wanted to use this assumption, spending a bit of time trying to think through what the assumption was, and deciding it was likely to be of independent interest. (Although that hasn't yet appeared in print.)
I'd be happy to have a better name for the research goal of making it so that this kind of assumption is true. I agree this isn't great. (And then I would probably be able to use that name in the description of this assumption as well.)
(See also the concept of "decoupled RL" from some DeepMind folks.)
Now that I understand "corrigible" isn't synonymous with “satisfying my short-term preferences-on-reflection”, “corrigibility is relatively easy to learn” doesn't seem enough to imply these things
I agree that you still need the AI to be trying to do the right thing (even though we don't e.g. have any clear definition of "the right thing"), and that seems like the main way that you are going to fail.
As I understand it, the original motivation for corrigibility_MIRI was to make sure that someone can always physically press the shutdown button, and the AI would shut off. But if a corrigible_Paul AI thinks (correctly or incorrectly) that my preferences-on-reflection (or "true" preferences) is to let the AI keep running, it will act against my (actual physical) attempts to shut down the AI, and therefore it's not corrigible_MIRI.
Note that "corrigible" is not synonymous with "satisfying my short-term preferences-on-reflection" (that's why I said: "our short-term preferences, including (amongst others) our preference for the agent to be corrigible.")
I'm just saying that when we talk about concepts like "remain in control" or "become better informed" or "shut down," those all need to be taken as concepts-on-reflection. We're not satisfying current-Paul's judgment of "did I remain in control?" they are the on-reflection notion of "did I remain in control"?
Whether an act-based agent is corrigible depends on our preferences-on-reflection (this is why the corrigibility post says that act-based agents "can be corrigible"). It may be that our preferences-on-reflection are for an agent to not be corrigible. It seems to me that for robustness reasons we may want to enforce corrigibility in all cases even if it's not what we'd prefer-on-reflection, for robustness reasons.
That said, even without any special measures, saying "corrigibility is relatively easy to learn" is still an important argument about the behavior of our agents, since it hopefully means that either (i) our agents will behave corrigibly, (ii) our agents will do something better than behaving corriglby, according to our preferences-on-reflection, (iii) our agents are making a predictable mistake in optimizing our preferences-on-reflection (which might be ruled out by them simply being smart enough and understanding the kinds of argument we are currently making).
By "corrigible" I think we mean "corrigible by X" with the X implicit. It could be "corrigible by some particular physical human."
(In that post I did use narrow in the way we are currently using short-term, contrary to my claim the grandparent. Sorry for the confusion this caused.)
Like I mentioned above, I mostly think of narrow value learning is a substitute for imitation learning or approval-direction, realistically to be used as a distillation step rather than as your whole AI. In particular, an agent trained with narrow value learning absolutely is probably not aligned+competitive in a way that might allow you to apply this kind of strategy-stealing argument.
In concrete approval-directed agents I'm talking about a different design, it's not related to narrow value learning.
I don't use narrow and short-term interchangeably. I've only ever used it in the context of value learning, in order to make this particular distinction between two different goals you might have when doing value learning.
One of us just misunderstood (1), I don't think there is any difference.
I mean preferences about what happens over the near future, but the way I rank "what happens in the near future" will likely be based on its consequences (further in the future, and in other possible worlds, and etc.). So I took (1) to be basically equivalent to (2).
"Terminal preferences over the near future" is not a thing I often think about and I didn't realize it was a candidate interpretation (normally when I write about short-term preferences I'm writing about things like control, knowledge, and resource acquisition).
By "short" I mean short in sense (1) and (2). "Short" doesn't imply anything about senses (3), (4), (5), or (6) (and "short" and "long" don't seem like good words to describe those axes, though I'll keep using them in this comment for consistency).
By "preferences-on-reflection" I mean long in sense (3) and neither in sense (6). There is a hypothesis that "humans with AI help" is a reasonable way to capture preferences-on-reflection, but they aren't defined to be the same. I don't use understandable and evaluable in this way.
I think (4) and (5) are independent axes. (4) just sounds like "is your AI good at optimizing," not a statement about what it's optimizing. In the discussion with Eliezer I'm arguing against it being linked to any of these other axes. (5) is a distinction about two senses in which an AI can be "optimizing my short-term preferences-on-reflection"
When discussing perfect estimations of preferences-on-reflection, I don't think the short vs. long distinction is that important. "Short" is mostly important when talking about ways in which an AI can fall short of perfectly estimating preferences-on-reflection.
Assuming my interpretation is correct, my confusion is that you say we shouldn't expect a situation where "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy" (I take you to be talking about sense (3) from above). It seems like the user-on-reflection and the current user would disagree about many things (that is the whole point of reflection), so if the AI acts in accordance with the intentions of the user-on-reflection, the current user is likely to end up unhappy.
I introduced the term "preferences-on-reflection" in the previous comment to make a particular distinction. It's probably better to say something like "actual preferences" (though this is also likely to be misinterpreted). The important property is that I'd prefer to have an AI that satisfies my actual preferences than to have any other kind of AI. We could also say "better by my lights" or something else.
There's a hypothesis that "what I'd say after some particular idealized process of reflection" is a reasonable way to capture "actual preferences," but I think that's up for debate---e.g. it could fail if me-on-reflection is selfish and has values opposed to current-me, and certainly it could fail for any particular process of reflection and so it might just happen to be the case that there is no process of reflection that satisfies it.
The claim I usually make is that "what I'd say after some particular idealized process of reflection" describes the best mechanism we can hope to find for capturing "actual preferences," because whatever else we might do to capture "actual preferences" can just be absorbed into that process of reflection.
"Actual preferences" is a pretty important concept here, I don't think we could get around the need for it, I'm not sure if there is disagreement about this concept or just about the term being used for it.
All three of these corrigible AIs deal with much narrower preferences than "acquire flexible influence that I can use to get what I want". The narrow value learner post for example says:
Imitation learning, approval-direction, and narrow value learning are not intended to exceed the overseer's capabilities. These are three candidates for the distillation step in iterated distillation and amplification.
The AI we actually deploy, which I'm discussing in the OP, is produced by imitating (or learning the values of, or maximizing the approval of) an even smarter AI---whose valuations of resources reflect everything that unaligned AIs know about which resources will be helpful.
Corrigibility is about short-term preferences-on-reflection. I see how this is confusing. Note that the article doesn't make sense at all when interpreted in the other way. For example, the user can't even tell whether they are in control of the situation, so what does it mean to talk about their preference to be in control of the situation if these aren't supposed to be preferences-on-reflection? (Similarly for "preference to be well-informed" and so on.) The desiderata discussed in the original corrigibility post seem basically the same as the user not being able to tell what resources will help them achieve their long-term goals, but still wanting the AI to accumulate those resources.
I also think the act-based agents post is correct if "preferences" means preferences-on-reflection. It's just that the three approaches listed at the top are limited to the capabilities of the overseer. I think that distinguishing between preferences-as-elicited and preferences-on-reflection is the most important thing to disambiguate here. I usually use "preference" to mean preference-on-idealized-reflection (or whatever "actual preference" should mean, acknowledging that we don't have a real ground truth definition), which I think is the more typical usage. I'd be fine with suggestions for disambiguation.
If there's somewhere else I've equivocated in the way you suggest, then I'm happy to correct it. It seems like a thing I might have done in a way that introduces an error. I'd be surprised if it hides an important problem (I think the big problems in my proposal are lurking other places, not here), and I think in the corrigibility post I think that I have these concepts straight.
One thing you might have in mind is the following kind of comment:
If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences.
That is, you might be concerned: "the user-on-reflection might be happy with the level of corrigibility, but the user themselves might be unhappy." I'm saying that you shouldn't expect this to happen, if the AI is well-calibrated and has enough of an understanding of humans to understand e.g. this discussion we are currently having---if it decides not to be corrigible, we should expect it to be right on average.
This seems too glib, if "long-term preferences" are in some sense the "right" preferences, e.g., if under reflective equilibrium we would wish that we currently put a lot more weight on long-term preferences. Even if we only give unaligned AIs a one-time advantage (which I'm not sure about LW), that could still cause us to lose much of the potential value of the universe.
To be clear, I am worried about people not understanding or caring about the long-term future, and AI giving them new opportunities to mess it up.
I'm particularly concerned about things like people giving their resources to some unaligned AI that seemed like a good idea at the time, rather than simply opting out of competition so that unaligned AIs might represent a larger share of future-influencers. This is another failure of strategy-stealing that probably belongs in the post---even if we understand alignment, there may be plenty of people not trying to solve alignment and instead doing something else, and the values generated by that "something else" will get a natural boost.