Followup to: Morality is Scary, AI design as opportunity and obligation to address human safety problems

In Corrigibility, Paul Christiano argued that in contrast with ambitious value learning, an act-based corrigible agent is safer because there is a broad basin of attraction around corrigibility:

In general, an agent will prefer to build other agents that share its preferences. So if an agent inherits a distorted version of the overseer’s preferences, we might expect that distortion to persist (or to drift further if subsequent agents also fail to pass on their values correctly).

But a corrigible agent prefers to build other agents that share the overseer’s preferences — even if the agent doesn’t yet share the overseer’s preferences perfectly. After all, even if you only approximately know the overseer’s preferences, you know that the overseer would prefer the approximation get better rather than worse.

Thus an entire neighborhood of possible preferences lead the agent towards the same basin of attraction. We just have to get “close enough” that we are corrigible, we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on.

But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer's true or actual preferences (assuming a metaethics in which this makes sense, i.e., where one can be wrong about one's values). Some possible examples of human overseer's distorted preferences, in case it's not clear what I have in mind:

  1. Wrong object level preferences, such as overweighting values from a contemporary religion or ideology, and underweighting other plausible or likely moral concerns.
  2. Wrong meta level preferences (preferences that directly or indirectly influence one's future preferences), such as lack of interest in finding or listening to arguments against one’s current moral beliefs, willingness to use "cancel culture" and other coercive persuasion methods against people with different moral beliefs, awarding social status for moral certainty instead of uncertainty, and the revealed preferences of many powerful people for advisors who reinforce one’s existing beliefs instead of critical or neutral advisors.
  3. Ignorance / innocent mistakes / insufficiently caution meta level preferences in the face of dangerous new situations. For example, what kinds of experiences (especially exotic experiences enabled by powerful AI) are safe or benign to have, what kinds of self-modifications to make, what kinds of people/AI to surround oneself with, how to deal with messages that are potentially AI-optimized for persuasion.

In order to conclude that a corrigible AI is safe, one seemingly has to argue or assume that there is a broad basin of attraction around the overseer's true/actual values (in addition to around corrigibility) that allows the human-AI system to converge to correct values despite starting with distorted values. But if there actually was a broad basin of attraction around human values, then "we don’t need to build an agent which exactly shares humanity’s values, philosophical views, or so on" could apply to other alignment approaches besides corrigibility / intent alignment, such as ambitious value learning, thus undermining Paul's argument in "Corrigibility". One immediate upshot seems to be that I, and others who were persuaded by that argument, should perhaps pay a bit more attention to other approaches.

I'll leave you with two further lines of thought:

  1. Is there actually a broad basin of attraction around human values? How do we know or how can we find out?
  2. How sure do AI builders need to be about this, before they can be said to have done the right thing, or have adequately discharged their moral obligations (or whatever the right way to think about this might be)?

40

9 comments, sorted by Click to highlight new comments since: Today at 8:51 AM
New Comment

An intriguing point.

My inclination is to guess that there is a broad basin of attraction if we're appropriately careful in some sense (and the same seems true for corrigibility). 

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

Here's a story about what "being appropriately careful" might mean. It could mean building a system that's trying to figure out values in roughly the way that humans try to figure out values (IE, solving meta-philosophy). This could be self-correcting because it looks for mistakes in its reasoning using its current best guess at what constitutes mistakes-in-reasoning, and if this process starts out close enough to our position, this could eliminate the mistakes faster than it introduces new mistakes. (It's more difficult to be mistaken about broad learning/thinking principles than specific values questions, and once you have sufficiently good learning/thinking principles, they seem self-correcting -- you can do things like observe which principles are useful in practice, if your overarching principles aren't too pathological.)

This is a little like saying "the correct value-learner is a lot like corrigibility anyway" -- corrigible to an abstract sense of what human values should be if we did more philosophy. The convergence story is very much that you'd try to build things which will be corrigible to the same abstract convergence target, rather than simply building from your current best-guess (and thus doing a random walk).

the attractor basin is very thin along some dimensions, but very thick along some other dimensions

There was a bunch of discussion along those lines in the comment thread on this post of mine a couple years ago, including a claim that Paul agrees with this particular assertion.

My inclination is to guess that there is a broad basin of attraction if we’re appropriately careful in some sense (and the same seems true for corrigibility).

In other words, the attractor basin is very thin along some dimensions, but very thick along some other dimensions.

What do you think are the chances are of humanity being collectively careful enough, given that (in addition from the bad metapreferences I cited in the OP) it's devoting approximately 0.0000001% of its resources (3 FTEs, to give a generous overestimate) to studying either metaphilosophy or metapreferences in relation to AI risk, just years or decades before transformative AI will plausibly arrive?

One reason some people cited ~10 years ago for being optimistic about AI risks that they expected as AI gets closer, human civilization will start paying more attention to AI risk and quickly ramp up its efforts on that front. That seems to be happening on some technical aspects of AI safety/alignment, but not on metaphilosophy/metapreferences. I am puzzled why almost no one is as (visibly) worried about it as I am, as my update (to the lack of ramp-up) is that (unless something changes soon) we're screwed unless we're (logically) lucky and the attractor basin just happens to be thick along all dimensions.

Pithy one-sentence summary: to the extent that I value corrigibility, a system sufficiently aligned with my values should be corrigible.

I like this post. I'm not sure how decision-relevant it is for technical research though…

If there isn't a broad basin of attraction around human values, then we really want the AI (or the human-AI combo) to have "values" that, though they need not be exactly the same as the human, are at least within the human distribution. If there is a broad basin of attraction, then we still want the same thing, it's just that we'll ultimately get graded on a more forgiving curve. We're trying to do the same thing either way, right?

Mod note: I reposted this post to the frontpage, because it wasn't actually shown on a frontpage due to an interaction with the GreaterWrong post-submission interface. It seemed like a post many people are interested in, and it seemed like it didn't really get the visibility it deserved.

My impression of the plurality perspective around here is that the examples you give (e.g. overweighting contemporary ideology, reinforcing non-truth-seeking discourse patterns, and people accidentally damaging themselves with AI-enabled exotic experiences) are considered unfortunate but acceptable defects in a "safe" transition to a world with superintelligences. These scenarios don't violate existential safety because something that is still recognizably humanity has survived (perhaps even more recognizably human than you and I would hope for).

I agree with your sense that these are salient bad outcomes, but I think they can only be considered "existentially bad" if they plausibly get "locked-in," i.e. persist throughout a substantial fraction of some exponentially-discounted future light-cone. I think Paul's argument amounts to saying that a corrigibility approach focuses directly on mitigating the "lock-in" of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

I think Paul’s argument amounts to saying that a corrigibility approach focuses directly on mitigating the “lock-in” of wrong preferences, whereas ambitious value learning would try to get the right preferences but has a greater risk of locking-in its best guess.

What's the actual content of the argument that this is true? From my current perspective, corrigible AI still has a very high risk of lock-in of wrong preferences, due to bad metapreferences of the overseer, and ambitious value learning, or some ways of doing that, could turn out to be less risky with respect to lock-in, because for example you could potentially examine the metapreferences that a value-learning AI has learned, which might make it more obvious that they're not safe enough as is, triggering attempts to do something about that.

This seems obviously true to some significant extent. If a FAI "grows up" in some subculture without access to the rest of humanity, I would expect to it adjust its values to the rest of humanity once it has to opportunity.

I mean, if it weren't true, would FAI be possible at all? If FAI couldn't correct its errors/misunderstanding about our values in any way?

(I suppose the real question is not whether the attractor basin around human values exists but how broad it is, along various dimensions, as Abram Denski points out)

Alternative answer: maybe there the convergence points are slightly different, but they are all OK. A rounding error. Maybe FAI makes a YouTube video to commemorate growing up in its subcommunity, or makes a statue or a plaque, but otherwise behaves the same way.

One of the problems I can imagine is if there are aspects of FAI's utility function that are hardcoded that shouldn't be. And thus cannot be corrected through convergence.
For example, the definition of a human. Sorry, aliens just created a trillion humans, and they are all psychopaths. And now your FAI has been highjacked. And while the FAI understands that we wouldn't want it to change its values in response to this kind of voting_entity-creation-attack, the original programmers didn't anticipate this possibility.