Tessellating Hills: a toy model for demons in imperfect search

by DaemonicSigil1 min read20th Feb 202012 comments

20

Inner AlignmentOptimizationProgramming
Frontpage

If you haven't already, take a look at this post by johnswentworth to understand what this is all about: https://www.lesswrong.com/posts/KnPN7ett8RszE79PH/demons-in-imperfect-search

The short version is that while systems that use perfect search, such as AIXI, have many safety problems, a whole new set of problems arises when we start creating systems that are not perfect searchers. Patterns can form that exploit the imperfect nature of the search function to perpetuate themselves. johnswentworth refers to such patterns as "demons".

After reading that post I decided to see if I could observe demon formation in a simple model: gradient descent on a not-too-complicated mathematical function. It turns out that even in this very simplistic case, demon formation can happen. Hopefully this post will give people an example of demon formation where the mechanism is simple and easy to visualize.

Model

The function we try to minimize using gradient descent is called the loss function. Here it is:

Let me explain what some of the parts of this loss mean. Each function is periodic with period 2π in every component of . I decided in this case to make my splotch functions out of a few randomly chosen sine waves added together.

is chosen to be a small number so in any local region, will look approximately periodic: A bunch of hills repeating over and over again with period 2π across the landscape. But over large enough distances, the relative weightings of various splotches do change. Travel a distance of 20π in the direction, and will be a larger component of the repeating pattern than it was before. This allows for selection effects.

The term means that the vector mainly wants to increase its component. But the splotch functions can also direct its motion. A splotch function might have a kind of ridge that directs some of the motion into other components. If tends to direct motion in such a way that , increases, then it will be selected for, becoming stronger and stronger as time goes on.

Results

I used ordinary gradient descent, with a constant step size, and with a bit of random noise added in. Figure 1 shows the value of x0 as a function of time, while figure 2 shows the values of x1,x2,…x16 as a function of time.

Fig 1:

Fig 2:

There are three phases to the evolution: In the first, increases steadily, and the other coordinates wander around more or less randomly. In the second phase, a self-reinforcing combination of splotches (a "demon") takes hold and amplifies itself drastically, feeding off the large gradient. Finally, this demon becomes so strong that the search gets stuck in a local valley and further progress stops. The first phase is more or less from 0 to 2500 steps. The second phase is between 2500 steps and 4000 steps, though slowing down after 3500. The final phase starts at 4000 steps, and likely continues indefinitely.

Now that I have seen demons arise in such a simple situation, it makes me wonder how commonly the same thing happens in the training of deep neural networks. Anyways, hopefully this is a useful model for people who want to understand the mechanisms behind the whole "demons in imperfect search" thing more clearly. It definitely helped me, at least.

Update: The code is now up here: https://github.com/DaemonicSigil/tessellating-hills

20