Treacherous Turn - AI Alignment Forum

Nick Bostrom came up with the idea of a treacherous turn for smart AIs.

while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.

There are several reasons for this dynamic:

Past a certain capability level, an agent can become able to take what it wants by force.
- Before reaching that threshold, the agent may seem to have others' interests at heart because it prefers to coordinate with other agents and engage in mutually beneficial trade.
- Past that threshold, however, the agent can get more of what it wants by other means.
Past a certain capability level, an agent is able to resist attempts by its operators to shut it down or change its goals.
- Before reaching that threshold, the agent may want to appear friendly so that programmers will have less reason to shut it down or modify its goals.
- Past that threshold, the agent no longer needs to fake friendliness, since it's now capable enough to do whatever it wants.

Ben Goertzel argued against the treacherous turn thesis, writing in 2016 that:

for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them. This is related to the everyday observation that maintaining a web of lies rapidly gets very complicated.

Goertzel's scenario is sometimes referred to as a "sordid stumble".

The likelihood of a sordid stumble vs. a treacherous turn turns on questions like "how common is deception?", "how complex and fragile is human value?", and "how easy is it to train for robust capabilities relative to robust alignment?". E.g., sordid stumbles are less likely insofar as many different tasks incentivize "being smarter and more capable in general", while relatively few incentivize "having exactly the same goals as a human".

Not to be confused with

Sharp Left Turn: A scenario where AI systems generalize across many domains, while the alignment properties that held at earlier stages fail to generalize well.
- A simple toy model might be "At some point, Narrow AI will suddenly transition to General AI, breaking various security assumptions and yielding a Sharp Left Turn. Then at a later point, General AI will become smart enough to take over the world, yielding a Treacherous Turn."
- However, the idea of a Sharp Left Turn is more general than this, and doesn't require that that it be easy to divide all AI systems into separate "narrow" vs. "general" buckets. E.g., there may be multiple inflection points in generality (one or more of which amount to a Sharp Left Turn).