I had an interesting debate recently, about whether we could make smart AIs safe just by focusing on their structure and their task. Specifically, we were pondering something like:

  • "Would an algorithm be safe if it was a neural net-style image classifier, trained on examples of melanoma to detect skin cancer, with no other role than to output a probability estimate for a given picture? Even if "superintelligent", could such an algorithm be an existential risk?"

Whether it's an existential risk was not resolved; but I have a strong intuition that they would like be manipulative. Let's see how.

The requirements for manipulation

For an algorithm to be manipulative, it has to derive some advantage from manipulation, and it needs to be able to learn to manipulate - for that, it needs to be able to explore situations where it engages in manipulation and this is to its benefit.

There are certainly very simple situations where manipulation can emerge. But that example, though simple, had an agent that was active in the world. Can a classifier display the same sort of behaviour?

Manipulation emerges naturally

To show that, picture the following design. The programmers have a large collection of slightly different datasets, and want to train the algorithm on all of them. The loss function is an error rate, which can vary between and . Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

How have the programmers set up the system? Well, they run the algorithm on batched samples from ten datasets at once, and record the error rate for all ten. The hyperparameters are set to minimise average error over each run of ten. When the performance on one dataset falls below error for a few runs, they remove it from the batches, and substitute in a new one to train the algorithm on[1].

So, what will happen? Well, the system will initially start to minimise the loss rate. Then, after a while, it will start "noticing" that when this error is below , then the dataset is removed and the average error rate shoots up. Actually, it won't be "noticing" this; instead, if there are hyperparameters that prevent the error rate going below , these will be selected by gradient descent as giving a better performance.

So, this problem of optimisation of image classification will result in the algorithm manipulating its programmers, by "underperforming" on key datasets, so that they stay in the batch.

Generic manipulation

Of course, the manipulation above happened because the programmers didn't understand what the algorithm's true loss function was. They thought it was "minimise overall loss on classification", but it was actually "keep each dataset loss just above ".

But that kind of mistake is easy to make - by carelessness, or because they thought it would improve the situation, the loss they thought they had diverged from the actual loss, and this resulted in manipulative behaviour.

Here is another example: suppose the programmers build in a precaution for things going wrong. If the error rate suddenly doubles on some dataset, then the algorithm is run exclusively on that dataset for a while, until the error rate declines again (or, if it doesn't, the programmers are called).

Now suppose that, of the ten datasets it has, nine have an error rate around , and one has an error rate of . Then it is in the algorithm's interest to boost the error rate on that last dataset, suddenly, to . Then it will be run for a time only on that one dataset, giving it a lower average error rate. Thus hyperparameters that allow this kind of behaviour, will get selected for (and if the algorithm is run on a lot of datasets, it will have many opportunities to learn about this kind of behaviour).

What about the "if it doesn't, the programmers are called" clause? This depends on what happens when the programmers are called, and how the error rate is computed in that case. But, depending on what does happen, this incentivises the algorithm to learn to call the programmers often (double an error rate and keep it high) or rarely (if there's a risk of the error rate for a dataset going up suddenly, then don't lower it too much in the first place).

So, manipulative behaviour can emerge in smart algorithms, even if they're merely classifying images in a way the programmers feel is entirely harmless.

Existential risks

Can this mild kind of manipulation lead to extreme, existential risk outcomes? Of course it; it's not hard to create toy models where everything ends disastrously. But it's also not hard to create toy models where manipulation remains mild.

But, in conclusion:

  • Manipulative behaviours can emerge in seemingly innocuous algorithms.
  • There is no clear division, currently, between mild manipulation and disastrous manipulation.

  1. Assume that either catastrophic forgetting has been solved, or that they'll re-run the removed datasets occasionally, to refresh the algorithm's performance on that dataset. ↩︎

New Comment
9 comments, sorted by Click to highlight new comments since:

It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.

That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.

Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.


Separately, I think your examples depend on this a lot:

Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I'll admit that "thinking longer term" is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

I'd suspect that's right, but I don't think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There's the use of the technical term "almost every", but you did not prove the set of "powerful" algorithms which is not "manipulative" has measure zero. There's also "would be" instead of "seems" (I think if you made this change, the title would be fine). I think it's vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.

Not that I disagree with your suspicion here.

Rephrased the title and the intro to make this clearer.

How dangerous would you consider a person with basic programming skills and a hypercomputer? I mean I could make something very dangerous, given hypercompute. I'm not sure if I could make much that was safe and still useful. How common would it be to accidentally evolve a race of aliens in the garbage collection?

At the moment, my best guess at what powerful algorithms look like is something that lets you maximize functions without searching through all the inputs. Gradient descent can often find a high point without that much compute, so is more powerful than random search. If your powerful algorithm is more like really good computationally bounded optimization, I suspect it will be about as manipulative as brute forcing the search space. (I see no strong reason for strategies labeled manipulative to be that much easier or harder to find than those that aren't.)

instead, if there are hyperparameters that prevent the error rate going below 0.1, these will be selected by gradient descent as giving a better performance.

I don't follow this point. If we're talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.

To keep it simple, assume the hyperparameters are updated by evolutionary algorithm or some similar search-then-continue-or-stop process.

I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is "myopic" (i.e. models that outperform others on the current batch of images have higher fitness).

(I might be over-pedantic about this topic due to previously being very confused about it.)