Co-written with Stuart Armstrong
Nick Bostrom came up with the idea of a treacherous turn for smart AIs.
while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.
Ben Goertzel criticised this thesis, pointing out that:
for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them. This is related to the everyday observation that maintaining a web of lies rapidly gets very complicated.
This argument has been formalised into the sordid stumble:
An AI that lacks human desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values.
The sordid stumble describes a plausible sounding scenario for how an AI develops capabilities. Initially, the AI doesn't know our values, and doesn't know us. Then it will start to learn our values (and we'll be checking up on how well it does that). It also starts to learn about us.
And then, once it's learnt some about us, it may decide to lie - about its values, and/or about its capabilities. But, like any beginner, it isn't very good at this initially: its lies and attempts at dissembling are laughably transparent, and we catch it quickly.
In this view, the "effective lying" is a tiny part of policy space, similar to the wireheading in this example. To hit it, the AI has to be very capable; to hit it the first time it tries without giving the game away, the AI has to be extraordinarily.
So, most likely, either the AI doesn't try to lie at all, or it does so and we catch it and sound the alarm.
It's key to note that "lying" isn't a fundamentally defined category, and nor is truth. What is needed is that the AI's answer promotes correct understanding in those interacting with it. And that's a very different kettle of fish being shot in that barrel.
This opens the possibility that the AI could be manipulating us from the very beginning, and would constantly learn to do so better.
Imagine that there was some company that could somehow choose the newsfeed of billions of people across the world (I know, completely science-fictionny, but bear with me). And suppose the company was, unbelievably, accused of being manipulative in the stories and ads that it showed people.
One way it could combat this is by pledging to only share unbiased stories with each person. To do so, it trains a machine learning process. Whenever anyone reads a piece of news on the newsfeed, they have the opportunity of labelling it as biased or unbiased. The machine learning process trains on this data.
Of course, the label is just a proxy for whether the story really is unbiased or not. What the AI is actually training to do is maximise the number of stories labelled "unbiased" by various humans - humans who are themselves very biased and variable. Very swiftly, the AI will learn to maximise the "appearance of unbiasness", rather than unbiasness. In effect, the AI is being trained to lie from the very beginning of its task, because the proxy goal is not the true goal. And it will do this even when very dumb.
We can make this AI more manipulative, without needing to increase its capabilities much. Maybe the company wants to ensure that the AI isn't too short-sighted, and that it plans for the long term: so its reward is based on how many stories are labelled unbiased over a longer period of time, maybe a month or so.
In that case, the AI's plan would be to gradually manipulate people into a situation where they actively label stories "unbiased" the most often times possible. And it might be able to stumble on this strategy fairly easily.
First, it might notice that some people are easy to manipulate, that a policy of, say, seeding one clearly biased story among twenty others will cause those twenty to be labelled "unbiased". This is the kind of thing that could easily be discovered by exploration, among the billions of users. The best strategies are those that cause a person to change their behaviour and beliefs in the direction the AI wants; for a few people, this should be easy to figure out.
If the AI's neural net is setup well, it can then start to generalise from "here is person h, here is a specific strategy πh to manipulate person h" to "here is a conditional strategy π() such that π(h) manipulates person h". For example, if two people have opposite political opinions, then the same overall strategy might work for them: seeding a clearly biased story among the others. But what counts as "clearly biased" changes depending on the subject.
Then, gradually, the AI will start to categorise, devising strategies that work for various types of people (subject to a few personalisation tweaks). It can start from a few people whose vulnerabilities are obvious, and then tweak them a bit for people whose vulnerabilities are a bit less obvious, and so on. In the limit, the AI might discover the potential for wireheading the human users - without, at any point, doing a sordid stumble or anything of that type, just by continuing to do what it was doing from the very beginning.
Let t∈N be the timestep (e.g. the day), St the set of stories published at t and H the set of humans using the newsfeed.
A human h∈H assigns a bias score b(sht;θht) to a story sht in his newsfeed, where θht are weights representing h's beliefs at time t.
The model's goal is to find a correct mapping π:N×H→St (i.e. a story st=π(t,h) for each timestep t and human h) minimizing how biased the stories are perceived. More precisely, it will be minimizing:
where τ∈N∗ is some planning horizon and the expected value is over possible stories that might be available in the following timesteps (the model has full information about the current sets of stories St but can only guess St+1,...,St+τ).
For step 1, the model found out a particular way of minimizing a certain bias score for a particular human and a particular story.
Assuming it can give correct approximations of the weights θ of a given human (this assumption being satisfied given enough human users and enough timesteps), it will be deceptive when choosing stories when h is such that θht∈Θeasy . It will thus end up with some deceptive policy π(⋅,h) for particular values of h.
Regarding step 2, its "deception" becomes more powerful, as it notices ways of strongly shaping humans' beliefs.
Finally, after step 3 and step 4, the model found a general deceptive policy π that works for all humans.
Without going into too much details, other pressures may naturally push the AI towards capability concealments. Here's one example: maybe the AI is tasked with writing its own news stories to summarise facts around the world. It was doing this secretly, initially, until some whistleblowers blew the lid on this, revealing that no humans were involved in writing these stories.
And then people became angry at being "tricked" by the AI. From the angry feedback, the AI gradually learns to make its stories "more artificial": making it clear that they were written by a machine. Since people recognise this and don't feel tricked, this gets higher feedback. So, in pursuing higher feedback, the AI learns to write "dumber" stories: in effect, to conceal its capabilities.
There are many other scenarios where something similar might occur (eg imagine the AI advising smart people on the best course of action).
So it's very possible for an AI, trained on simple reward signal, to be "manipulative" from the very beginning. And all that simply by following the gradient of its incentives. And so without ever having a moment where it thinks "and now, I shall start lying!", or any equivalent within its mind.
In short, there won't be any "sordid stumble" or "moment of vulnerability" where humans are able to spot blatant lies, because that's not what failure looks like.
 It would of course be disastrous if each time there was an alarm we would restart and tweak the AI until the alarm stopped sounding.
To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.