Plausibly, almost every powerful algorithm would be manipulative

[-]Daniel Kokotajlo6y30

It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.

That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.

Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.

Separately, I think your examples depend on this a lot:

Many of the hyperparameters are set by a neural net, which itself takes a more "long-term view" of the error rate, trying to improve it from day to day rather than from run to run.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

[-]Stuart_Armstrong6y10

If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

Is this such a common practice that we can expect "almost every powerful algorithm" to involve it somehow?

No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I'll admit that "thinking longer term" is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.

[-]TurnTrout6y40

I'd conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.

I'd suspect that's right, but I don't think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There's the use of the technical term "almost every", but you did not prove the set of "powerful" algorithms which is not "manipulative" has measure zero. There's also "would be" instead of "seems" (I think if you made this change, the title would be fine). I think it's vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.

Not that I disagree with your suspicion here.

[-]Stuart_Armstrong6y10

Rephrased the title and the intro to make this clearer.

[-]romeostevensit6y20

Feels relevant to https://www.alignmentforum.org/posts/CHSRhSKcrSmQWnD6A/towards-an-intentional-research-agenda

[-]Donald Hobson6y10

How dangerous would you consider a person with basic programming skills and a hypercomputer? I mean I could make something very dangerous, given hypercompute. I'm not sure if I could make much that was safe and still useful. How common would it be to accidentally evolve a race of aliens in the garbage collection?

At the moment, my best guess at what powerful algorithms look like is something that lets you maximize functions without searching through all the inputs. Gradient descent can often find a high point without that much compute, so is more powerful than random search. If your powerful algorithm is more like really good computationally bounded optimization, I suspect it will be about as manipulative as brute forcing the search space. (I see no strong reason for strategies labeled manipulative to be that much easier or harder to find than those that aren't.)

[-]Ofer6y10

instead, if there are hyperparameters that prevent the error rate going below 0.1, these will be selected by gradient descent as giving a better performance.

I don't follow this point. If we're talking about using SGD to update (hyper)parameters, using a batch of images from the currently used datasets, then the gradient update would be determined by the gradient of the loss with respect to that batch of images.

[-]Stuart_Armstrong6y10

To keep it simple, assume the hyperparameters are updated by evolutionary algorithm or some similar search-then-continue-or-stop process.

[-]Ofer6y10

I want to flag that—in the case of evolutionary algorithms—we should not assume here that the fitness function is defined with respect to just the current batch of images, but rather with respect to, say, all past images so far (since the beginning of the entire training process); otherwise the selection pressure is "myopic" (i.e. models that outperform others on the current batch of images have higher fitness).

(I might be over-pedantic about this topic due to previously being very confused about it.)

Assume that either catastrophic forgetting has been solved, or that they'll re-run the removed datasets occasionally, to refresh the algorithm's performance on that dataset. ↩︎

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

14

Plausibly, almost every powerful algorithm would be manipulative

14

The requirements for manipulation

Manipulation emerges naturally

Generic manipulation

Existential risks