In Corrigibility, Paul Christiano argued that in contrast with ambitious value learning, an act-based corrigible agent is safer because there is a broad basin of attraction around corrigibility:
In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).
But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.
Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences (assuming a metaethics in which this makes sense, i.e., where one can be wrong about one's values). Some possible examples of human overseer's distorted preferences, in case it's not clear what I have in mind:
- Wrong object level preferences, such as overweighting values from a contemporary religion or ideology, and underweighting other plausible or likely moral concerns.
- Wrong meta level preferences (preferences that directly or indirectly influence one's future preferences), such as lack of interest in finding or listening to arguments against one’s current moral beliefs, willingness to use "cancel culture" and other coercive persuasion methods against people with different moral beliefs, awarding social status for moral certainty instead of uncertainty, and the revealed preferences of many powerful people for advisors who reinforce one’s existing beliefs instead of critical or neutral advisors.
- Ignorance / innocent mistakes / insufficiently caution meta level preferences in the face of dangerous new situations. For example, what kinds of experiences (especially exotic experiences enabled by powerful AI) are safe or benign to have, what kinds of self-modifications to make, what kinds of people/AI to surround oneself with, how to deal with messages that are potentially AI-optimized for persuasion.
In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer's true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values. But if there actually was a broad basin of attraction around human values, then "we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on" could apply to other alignment approaches besides corrigibility / intent alignment, such as ambitious value learning, thus undermining Paul's argument in "Corrigibility". One immediate upshot seems to be that I, and others who were persuaded by that argument, should perhaps pay a bit more attention to other approaches.
I'll leave you with two further lines of thought:
- Is there actually a broad basin of attraction around human values? How do we know or how can we find out?
- How sure do AI builders need to be about this, before they can be said to have done the right thing, or have adequately discharged their moral obligations (or whatever the right way to think about this might be)?