FactorialCode — AI Alignment Forum

Hypothesis: Unlike the language models before it and ignoring context length issues, GPT-3's primary limitation is that it's output mirrors the distribution it was trained on. Without further intervention, it will write things that are no more coherent than the average person could put together. By conditioning it on output from smart people, GPT-3 can be switched into a mode where it outputs smart text.

Collection of GPT-3 results

FactorialCode5y30

According to Gwern, it fails the Parity Task.

Tessellating Hills: a toy model for demons in imperfect search

FactorialCode6y30

Hmm, the inherent 1d nature of the visualization kinda makes it difficult to check for selection effects. I'm not convinced that's actually what's going on here. 1725 is special because the ridges of the splotch function are exactly orthogonal to x0. The odds of this happening probably go down exponentially with dimensionality. Furthermore, with more dakka, one sees that the optimization rate drops dramatically after ~15000 time steps, and may or may not do so again later. So I don't think this proves selection effects are in play. An alternative hypothesis is simply that the process gets snagged by the first non-orthogonal ridge it encounters, without any serous selection effects coming into play.

Tessellating Hills: a toy model for demons in imperfect search

FactorialCode6y80

Now this is one of the more interesting things I've come across.

I fiddled around with the code a bit and was able to reproduce the phenomenon with DIMS = 1, making visualisation possible:

Behold!

Here's the code I used to make the plot:

import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

DIMS = 1   # number of dimensions that xn has
WSUM = 5    # number of waves added together to make a splotch
EPSILON = 0.10 # rate at which xn controlls splotch strength
TRAIN_TIME = 5000 # number of iterations to train for
LEARN_RATE = 0.2   # learning rate
MESH_DENSITY = 100 #number of points ot plt in 3d mesh (if applicable)

torch.random.manual_seed(1729)

# knlist and k0list are integers, so the splotch functions are periodic
knlist = torch.randint(-2, 3, (DIMS, WSUM, DIMS)) # wavenumbers : list (controlling dim, wave id, k component)
k0list = torch.randint(-2, 3, (DIMS, WSUM))       # the x0 component of wavenumber : list (controlling dim, wave id)
slist = torch.randn((DIMS, WSUM))                # sin coefficients for a particular wave : list(controlling dim, wave id)
clist = torch.randn((DIMS, WSUM))                # cos coefficients for a particular wave : list (controlling dim, wave id)

# initialize x0, xn
x0 = torch.zeros(1, requires_grad=True)
xn = torch.zeros(DIMS, requires_grad=True)

# numpy arrays for plotting:
x0_hist = np.zeros((TRAIN_TIME,))
xn_hist = np.zeros((TRAIN_TIME, DIMS))
loss_hist = np.zeros(TRAIN_TIME,)


def model(xn,x0):
    wavesum = torch.sum(knlist*xn, dim=2) + k0list*x0
    splotch_n = torch.sum(
            (slist*torch.sin(wavesum)) + (clist*torch.cos(wavesum)),
            dim=1)
    foreground_loss = EPSILON * torch.sum(xn * splotch_n)
    return foreground_loss - x0

# train:
for t in range(TRAIN_TIME):

    print(t)
    loss = model(xn,x0)
    loss.backward()
    with torch.no_grad():
        # constant step size gradient descent, with some noise thrown in
        vlen = torch.sqrt(x0.grad*x0.grad + torch.sum(xn.grad*xn.grad))
        x0 -= LEARN_RATE*(x0.grad/vlen + torch.randn(1)/np.sqrt(1.+DIMS))
        xn -= LEARN_RATE*(xn.grad/vlen + torch.randn(DIMS)/np.sqrt(1.+DIMS))
    x0.grad.zero_()
    xn.grad.zero_()
    x0_hist[t] = x0.detach().numpy()
    xn_hist[t] = xn.detach().numpy()
    loss_hist[t] = loss.detach().numpy()

plt.plot(x0_hist)
plt.xlabel('number of steps')
plt.ylabel('x0')
plt.show()
for d in range(DIMS):
    plt.plot(xn_hist[:,d])
plt.xlabel('number of training steps')
plt.ylabel('xn')
plt.show()

fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot3D(x0_hist,xn_hist[:,0],loss_hist)

#plot loss landscape
if DIMS == 1:
    x0_range = np.linspace(np.min(x0_hist),np.max(x0_hist),MESH_DENSITY)
    xn_range = np.linspace(np.min(xn_hist),np.max(xn_hist),MESH_DENSITY)
    x,y = np.meshgrid(x0_range,xn_range)
    z = np.zeros((MESH_DENSITY,MESH_DENSITY))
    with torch.no_grad():
        for i,x0 in enumerate(x0_range):
            for j,xn in enumerate(xn_range):
                z[j,i] = model(torch.tensor(xn),torch.tensor(x0)).numpy()
    ax.plot_surface(x,y,z,color='orange',alpha=0.3)
ax.set_title("loss")
plt.show()

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

FactorialCode6y50

My impression is that people working on self-driving cars are incredibly safety-conscious, because the risks are very salient.

Safety conscious people working on self driving cars don't program their cars to not take evasive action after detecting that a collision is imminent.

(It's notable to me that this doesn't already happen, given the insane hype around AI.)

I think it already has.(It was for extra care, not drugs, but it's a clear cut case of a misspecified objective function leading to suboptimal decisions for a multitude of individuals.) I'll note, perhaps unfairly, that the fact that this study was not salient enough to make it to your attention even with a culture war signal boost is evidence that it needs to be a Chernobyl level event.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

FactorialCode6y20

My worry is less that we wouldn't survive AI-Chernobyl as much as it is that we won't get an AI-Chernobyl.

I think that this is where there's a difference in models. Even in a non-FOOM scenario I'm having a hard time envisioning a world where the gap in capabilities between AI-Chernobyl and global catastrophic UFAI is that large. I used Chernobyl as an example because it scared the public and the industry into making things very safe. It had a lot going for it to make that happen. Radiation is invisible and hurts you by either killing you instantly, making your skin fall off, or giving you cancer and birth defects. The disaster was also extremely expensive, with the total costs on the order of 10^11 USD$.

If a defective AI system manages to do something that instils the same level of fear into researchers and the public as Chernobyl did, I would expect that we were on the cusp of building systems that we couldn't control at all.

If I'm right and the gap between those two events is small, then there's a significant risk that nothing will happen in that window. We'll get plenty of warnings that won't be sufficient to instil the necessary level of caution into the community, and later down the road we'll find ourselves in a situation we can't recover from.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

FactorialCode6y50

I agree that ML often does this, but only in situations where the results don't immediately matter. I'd find it much more compelling to see examples where the "random fix" caused actual bad consequences in the real world.

[...]

Perhaps people are optimizing for "making pretty pictures" instead of "negative log likelihood". I wouldn't be surprised if for many applications of GANs, diversity of images is not actually that important, and what you really want is that the few images you do generate look really good. In that case, it makes complete sense to push primarily on GANs, and while you try to address mode collapse, when faced with a tradeoff you choose GANs over VAEs anyway.

This is fair. However, the point of the example is more that mode dropping and bad NLL were not noticed when people started optimizing GANs for image quality. As far as I can tell, it took a while for individuals to notice, longer for it to become common knowledge, and even more time for anyone to do anything about it. Even now, the "solutions" are hacks that don't completely resolve the issue.

There was a large window of time where a practitioner could implement a GAN expecting it to cover all the modes. If there was a world where failing to cover all the modes of the distribution lead to large negative consequences, the failure would probably have gone unnoticed until it was too late.

Here's a real example. This is the NTSB crash report for the Uber autonomous vehicle that killed a pedestrian. Someone should probably do an in depth analysis of the whole thing, but for now I'll draw your attention to section 1.6.2. Hazard Avoidance and Emergency Braking. In it they say:

When the system detects an emergency situation, it initiates action suppression. This is a one-second period during which the ADS suppresses planned braking while the (1) system verifies the nature of the detected hazard and calculates an alternative path, or (2) vehicle operator takes control of the vehicle. ATG stated that it implemented action suppression process due to the concerns of the developmental ADS identifying false alarms—detection of a hazardous situation when none exists—causing the vehicle to engage in unnecessary extreme maneuvers.

[...]

if the collision cannot be avoided with the application of the maximum allowed braking, the system is designed to provide an auditory warning to the vehicle operator while simultaneously initiating gradual vehicle slowdown. In such circumstance, ADS would not apply the maximum braking to only mitigate the collision.

This strikes me as a "random fix" where the core issue was that the system did not have sufficient discriminatory power to tell apart a safe situation from an unsafe situation. Instead of properly solving this problem, the researchers put in a hack.

Suppose that we had extremely compelling evidence that any AI system run with > X amount of compute would definitely kill us all. Do you expect that problem to get swept under the rug?

I agree that we shouldn't be worried about situations where there is a clear threat. But that's not quite the class of failures that I'm worried about. Fairness, bias, and adversarial examples are all closer to what I'm getting at. The general pattern is that ML researchers hack together a system that works, but has some problems they're unaware of. Later, the problems are discovered and the reaction is to hack together a solution. This is pretty much the opposite of the safety mindset EY was talking about. It leaves room for catastrophe in the initial window when the problem goes undetected, and indefinitely afterwards if the hack is insufficient to deal with the issue.

More specifically, I'm worried about a situation where at some point during grad student decent someone says, "That's funny..." then goes on to publish their work. Later, someone else deploys their idea plus 3 orders of magnitude more computing power and we all die. That, or we don't all die. Instead we resolve the issue with a hack. Then a couple bumps in computing power and capabilities later we all die.

The above comes across as both paranoid and farfeched, and I'm not sure the AI community will take on the required level of caution to prevent it unless we get an AI equivalent of Chernobyl before we get UFAI. Nuclear reactor design is the only domain I know of where people are close to sufficiently paranoid.

[AN #80]: Why AI risk might be solved without additional intervention from longtermists

FactorialCode6y80

A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn't scale. I don't know why there are different underlying intuitions here.

I'd be interested to hear a bit more about your position on this.

I'm going to argue for the "applying bandaid fixes that don't scale" position for a second. To me, it seems that there's a strong culture in ML of "apply random fixes until something looks like it works" and then just rolling with whatever comes out of that algorithm.

I'll draw attention to image modelling to illustrate what I'm pointing at. Up until about 2014, the main metric for evaluating an image quality was the bayesian negative log likelyhood. As far as I can tell, this goes all the way back to at least "To Recognize Shapes, First Learn to Generate Images" Where the CD algorithm acts to minimize the log likelihood of the data. This can be seen in the VAE paper and also the original GAN paper. However, after GANs became popular, the log likelyhood metric seemed to have gone out the window. The GANs made really compelling images. Due to the difficulty of evaluating NLL, people invented new metrics. ID and FID were used to assess the quality of the generated images. I might be wrong, but I think it took a while after that for people to realize that SOTA GANs we're getting terrible NNLs compared to SOTA VAEs, even though the VAE's generated images that we're significantly blurrier/noisier. It also became obvious that GANs were dropping modes of the distribution, effectively failing to model entire classes of images.

As far as I can, tell there's been a lot of work to get GANs to model all image modes. The most salient and recent would be DeepMinds PresGAN . Where they clearly show the issue and how PresGAN solves it in Figure 1. However, looking at table 5, there's still a huge gap between in NLL between PresGAN and VAEs. It seems to me that most of the attempt to solve this issue are very similar to "bandaid fixes that don't scale" in the sense that they mostly feel like hacks. None of them really address the gap in likelyhood between VAEs and GANs.

I'm worried that a similar story could happen with AI safety. A problem arises and gets swept under the rug for a bit. Later, it's rediscovered and becomes common knowledge. Then, instead of solving it before moving forward, we see massive increases in capabilities. Simultaneously, the problem is at most addressed with hacks that don't really solve the problem, or solve it just enough to prevent the increase in capabilities from becoming obviously unjustified.

Inductive biases stick around

FactorialCode6y40

Does anyone know if double decent happens when you look at the posterior predictive rather than just the output of SGD? I wouldn't be too surprised if it does, but before we start talking about the bayesian perspective, I'd like to see evidence that this isn't just an artifact of using optimization instead of integration.

Understanding “Deep Double Descent”

FactorialCode6y20

I wonder if this is a neural network thing, an SGD thing, or a both thing? I would love to see what happens when you swap out SGD for something like HMC, NUTS or ATMC if we're resource constrained. If we still see the same effects then that tells us that this is because of the distribution of functions that neural networks represent, since we're effectively drawing samples from an approximation to the posterior. Otherwise, it would mean that SGD is plays a role.

what exactly are the magical inductive biases of modern ML that make interpolation work so well?

Are you aware of this work and the papers they cite?

From the abstract:

We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2{\pi} log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Pérez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms.

I would field the hypothesis that large volumes of neural network space are devoted to functions that are similar to functions with low K-complexity, and small volumes of NN-space are devoted to functions that are similar to high K-complexity functions. Leading to a Solomonoff-like prior over functions.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments