# All of FactorialCode's Comments + Replies

Hypothesis: Unlike the language models before it and ignoring context length issues, GPT-3's primary limitation is that it's output mirrors the distribution it was trained on. Without further intervention, it will write things that are no more coherent than the average person could put together. By conditioning it on output from smart people, GPT-3 can be switched into a mode where it outputs smart text.

According to Gwern, it fails the Parity Task.

Hmm, the inherent 1d nature of the visualization kinda makes it difficult to check for selection effects. I'm not convinced that's actually what's going on here. 1725 is special because the ridges of the splotch function are exactly orthogonal to x0. The odds of this happening probably go down exponentially with dimensionality. Furthermore, with more dakka, one sees that the optimization rate drops dramatically after ~15000 time steps, and may or may not do so again later. So I don't think this proves selection effects are in play. An alternative hypothesi

...

Now this is one of the more interesting things I've come across.

I fiddled around with the code a bit and was able to reproduce the phenomenon with DIMS = 1, making visualisation possible:

Behold!

Here's the code I used to make the plot:

import torch
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d

DIMS = 1   # number of dimensions that xn has
WSUM = 5    # number of waves added together to make a splotch
EPSILON = 0.10 # rate at which xn controlls splotch strength
TRAIN_TIME = 5000 # number of iterations to train for
LEARN
...
1DaemonicSigil4y
That's very cool, thanks for making it. At first I was worried that this meant that my model didn't rely on selection effects. Then I tried a few different random seeds, and some, like 1725, didn't show demon-like behaviour. So I think we're still good.

My impression is that people working on self-driving cars are incredibly safety-conscious, because the risks are very salient.

Safety conscious people working on self driving cars don't program their cars to not take evasive action after detecting that a collision is imminent.

(It's notable to me that this doesn't already happen, given the insane hype around AI.)

I think it already has.(It was for extra care, not drugs, but it's a clear cut case of a misspecified objective function leading to suboptimal decisions for a multitude of individuals.) I'll n

...
2Rohin Shah4y
I agree that Tesla does not seem very safety conscious (but it's notable that they are still safer than human drivers in terms of fatalities per mile, if I remember correctly?) Huh, what do you know. Faced with an actual example, I'm realizing that what I actually expect would cause people to take it more seriously is a) the belief that AGI is near and b) an example where the AI algorithm "deliberately" causes a problem (i.e. "with full knowledge" that the thing it was doing was not what we wanted). I think most deep RL researchers already believe that reward hacking is a thing (which is what that study shows). Tangential, but that makes it less likely that I read it; I try to completely ignore anything with the term "racial bias" in its title unless it's directly pertinent to me. (Being about AI isn't enough to make it pertinent to me.)

My worry is less that we wouldn't survive AI-Chernobyl as much as it is that we won't get an AI-Chernobyl.

I think that this is where there's a difference in models. Even in a non-FOOM scenario I'm having a hard time envisioning a world where the gap in capabilities between AI-Chernobyl and global catastrophic UFAI is that large. I used Chernobyl as an example because it scared the public and the industry into making things very safe. It had a lot going for it to make that happen. Radiation is invisible and hurts you by either killing you instantly, making

...
3Rohin Shah4y
My impression is that people working on self-driving cars are incredibly safety-conscious, because the risks are very salient. I don't think AI-Chernobyl has to be a Chernobyl level disaster, just something that makes the risks salient. E.g. perhaps an elder care AI robot pretends that all of its patients are fine in order to preserve its existence, and this leads to a death and is then discovered. If hospitals let AI algorithms make decisions about drugs according to complicated reward functions, I would expect this to happen with current capabilities. (It's notable to me that this doesn't already happen, given the insane hype around AI.)

I agree that ML often does this, but only in situations where the results don't immediately matter. I'd find it much more compelling to see examples where the "random fix" caused actual bad consequences in the real world.

[...]

Perhaps people are optimizing for "making pretty pictures" instead of "negative log likelihood". I wouldn't be surprised if for many applications of GANs, diversity of images is not actually that important, and what you really want is that the few images you do generate look really good. In that case, it makes complete sense to p

...
1johnswentworth4y
Important thing to remember is that Rohin is explicitly talking about a non-foom scenario, so the assumption is that humanity would survive AI-Chernobyl.

A likely crux is that I think that the ML community will actually solve the problems, as opposed to applying a bandaid fix that doesn't scale. I don't know why there are different underlying intuitions here.

I'd be interested to hear a bit more about your position on this.

I'm going to argue for the "applying bandaid fixes that don't scale" position for a second. To me, it seems that there's a strong culture in ML of "apply random fixes until something looks like it works" and then just rolling with whatever comes out of that algorithm.

I'll draw attention

...
To me, it seems that there's a strong culture in ML of "apply random fixes until something looks like it works" and then just rolling with whatever comes out of that algorithm.

I agree that ML often does this, but only in situations where the results don't immediately matter. I'd find it much more compelling to see examples where the "random fix" caused actual bad consequences in the real world.

I'll draw attention to image modelling to illustrate what I'm pointing at. [...] It also became obvious that GANs were d
...

Does anyone know if double decent happens when you look at the posterior predictive rather than just the output of SGD? I wouldn't be too surprised if it does, but before we start talking about the bayesian perspective, I'd like to see evidence that this isn't just an artifact of using optimization instead of integration.

I wonder if this is a neural network thing, an SGD thing, or a both thing? I would love to see what happens when you swap out SGD for something like HMC, NUTS or ATMC if we're resource constrained. If we still see the same effects then that tells us that this is because of the distribution of functions that neural networks represent, since we're effectively drawing samples from an approximation to the posterior. Otherwise, it would mean that SGD is plays a role.

what exactly are the magical inductive biases of modern ML that make interpolation work so wel

...
2Evan Hubinger4y
Neither, actually—it's more general than that. Belkin et al. show that it happens even for simple models like decision trees. Also see here for an example with polynomial regression. Yeah, I am. I definitely think that stuff is good, though ideally I want something more than just “approximately K-complexity.”

Deriving bounds on the generalization error might seem pointless when it's easy to do this by just holding out a validation set. I think the main value is in providing a test of purported theories: your 'explanation' for why neural networks generalize ought to be able to produce non-trivial bounds on their generalization error.

I think there's more value to the exercise than just that, it may be less useful in the iid case with lots of data where having a "validation set" makes sense, but there are many non-IID time series problems where effectively your

...
2interstice4y
Not sure if I agree regarding the real-world usefulness. For the non-IID case, PAC-Bayes bounds fail, and to re-instate them you'd need assumptions about how quickly the distribution changes, but then it's plausible that you could get high probability bounds based on the most recent performance. For small datasets, the PAC-Bayes bounds suffer because they scale as √KLN . (I may edit the post to be clearer about this) Agreed that analyzing how the bounds change under different conditions could be insightful though. Ultimately I suspect that effective bounds will require powerful ways to extract 'the signal from the noise', and examining the signal will likely be useful for understanding if a model has truly learned what it is supposed to.

I'll take a crack at this.

To a first order approximation, something is a "big deal" to an agent if it causes a "large" swing in its expected utility.

do you think any reasonable extension of these kinds of ideas could get what we want?

Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the "most diamond" might not be the maximum of it's utility function, but I would expect the maximum of it's utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it's

...

Do you think we could build a diamond maximizer using those ideas, though?

They're definitely not sufficient, almost certainly. A full fledged diamond maximizer would need far more machinery, if only to do the maximization and properly learn the representation.

The concern here is that the representation has to cleanly demarcate what we think of as diamonds.

I think this touches on a related concern, namely goodharting. If we even slightly miss-specify the utility function at the boundary and the AI optimize in an unrestrained fashion, we'll end up wit

...
2Alex Turner4y
Clarification: I meant (but inadequately expressed) "do you think any reasonable extension of these kinds of ideas could get what we want?" Obviously, it would be a quite unfair demand for rigor to demand whether we can do the thing right now. Thanks for the great reply. I think the remaining disagreement might boil down to the expected difficulty of avoiding Goodhart here. I do agree that using representations is a way around this issue, and it isn't the representation learning approach's job to simultaneously deal with Goodharting.

I'm personally far more optimistic about ontology identification. Work in representation learning, blog posts such as OpenAI's sentiment neuron, and style transfer, all indicate that it's at least possible to point at human level concepts in a subset of world models. Figuring out how to refine these learned representations to further correspond with our intuitions, and figuring out how to rebind those concepts to representations in more advanced ontologies are both areas that are neglected, but they're both problems that don't seem fundamentally intractabl

...
2Alex Turner4y
I wasn't aware of that work, thanks for linking! It's true that we don't have to specify the representation; instead, we can learn it. Do you think we could build a diamond maximizer using those ideas, though? The concern here is that the representation has to cleanly demarcate what we think of as diamonds, if we want the optimal policy to entail actually maximizing diamonds in the real world. This problem tastes like it has a bit of that 'fundamentally intractable' flavor.

Under this view, alignment isn’t a property of reward functions: it’s a property of a reward function in an environment. This problem is much, much harder: we now have the joint task of designing a reward function such that the best way of stringing together favorable observations lines up with what we want. This task requires thinking about how the world is structured, how the agent interacts with us, the agent’s possibilities at the beginning, how the agent’s learning algorithm affects things…

I think there are ways of doing this that don't involve exp

...
2Alex Turner4y
I feel somewhat pessimistic about doing this robustly enough to scale to AGI. From an earlier comment of mine:

I think this is a good sign, this paper goes over many of the ideas that the RatSphere has discussed for years, and Deepmind is giving those ideas publicity. It also brings up preliminary solutions, of which, "Model Based Rewards" seems to go farthest in the right direction.(Although even the paper admits the idea's been around since 2011)

However, the paper is still phrasing things in terms of additive reward functions, which don't really naturally capture many kinds of preferences (such as those over possible worlds). I also feel that the causal influence

...
4Tom Everitt4y
Thanks for the Dewey reference, we'll add it.