leogao

Working on alignment at EleutherAI

Implications of automated ontology identification

I agree that there will be cases where we have ontological crises where it's not clear what the answer is, i.e whether the mirrored dog counts as "healthy". However, I feel like the thing I'm pointing at is that there is some sort of closure of any given set of training examples where, for some fairly weak assumptions, we can know that everything in this expanded set is "definitely not going too far". As a trivial example, anything that is a direct logical consequence of anything in the training set would be part of the completion. I expect any ELK solutions to look something like that. This corresponds directly to the case where the ontology identification process converges to some set smaller than the entire set of all cases.

Implications of automated ontology identification

My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes *only* when the true answer is yes), and this reporter can label at least one additional data point that we couldn't label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to either incorporate a perfect understanding of human value extrapolation baked into the process somehow, or implies that such an automated ontology identification process is not possible.

Personally, I think that it's possible that we don't really need a lot of understanding of human values beyond what we can extract from the easy set to figure out "when to stop." There seems to be a pretty big set where it's totally unambiguous what it means for the diamond to be there, and yet humans are unable to label it, and to me this seems like the main set of things we're worried about with ELK.

Should we rely on the speed prior for safety?

A GLUT can have constant time complexity using a hash table, which makes it a lot less clear that metalearning can be faster

Understanding Gradient Hacking

From a zoomed-out perspective, the model is not modifying the loss landscape. This frame, however, does not give us a useful way of thinking about how gradient hacking might occur and how to avoid it.

I think that the main value of the frame is to separate out the different potential ways gradient hacking can occur. I've noticed that in discussions without this distinction, it's very easy to equivocate between the types, which leads to frustrating conversations where people fundamentally disagree without realizing (i.e someone might be talking about some mins of the base objective being dangerous, while the other person is talking about the model actually breaking SGD and avoiding mins entirely). Illusion of transparency and all that.

For instance, when you mention that "[continuously failing] gradient hacking will in general lead to worse performance on the base-objective" -- this is only true for *nonconvergent* gradient hackers! I can see where you're coming from; most mesaobjectives have mins that don't really line up with the base objective, and so a lot of the time moving towards a min of the mesaobjective hurts performance on the base objective. In some sense, convergent gradient hackers are those whose mins happen to line up with the mins of the base objective. However, even though in some intuitive sense convergent mesaobjectives are very very sparse in the space of all mesaobjectives, it's very likely that convergent mesaobjectives are the ones we think are actually likely to exist in the real world (also, as a data point, Evan says that convergent gradient hackers are what he considers "gradient hacking" for instance).

In fact, I'd argue that coupling, assuming it were possible for the sake of argument, would imply a convergent gradient hacker (the reverse implication is not necessarily true) -- a rough and not very rigorous sketch of the argument: if the model remains coupled across training, then at any point in training any epsilon change which hurts the mesaobjective must also hurt the base objective, thus the inner product of the mesa gradient and base gradient must always be nonnegative, thus if SGD converges it must converge to a min or saddle point of the mesaobjective too (whereas the hallmark of nonconvergent gradient hackers is they manage to find a min of the mesaobjective somewhere where base gradient is *nonzero*).

A positive case for how we might succeed at prosaic AI alignment

My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it's much harder to slip under the radar by getting really lucky than by intentionally trying to.

Meta learning to gradient hack

Here's a hand crafted way of doing gradient protection in this case I can think of: since these models are blocks of linear->bn(affine)->relu, if you make the beta in the affine really small, you can completely zero out the output of that block and then the rest of the model can only learn a constant function. You can also get around L2: just set i.e gamma to 0.00001 and beta to -0.01; this lets you have both really small parameter magnitudes and also still saturate the relu. As this model is trained on the base objective it should converge to a constant f(x) = 0.

I think it would be really useful to look at the output after each relu, as well as the affine weights in the bns, to see if this is happening. If that doesn't show all-zeros anywhere, we can also look at the magnitude of the gradient and the eigenvalues of the Hessian (these models are small so it should be cheap to compute) in both the meta-optimized and non meta-optimized models, to get an idea of whether the model is in a critical point and whether it's a min/saddle.

I think that if the relu saturation thing is really what's happening then it's a pretty easy to deal with case of gradient hacking. With relu saturation you can either pass information through (and subject the model to gradients) or pass no information through (and protect the model). This is pretty useless in practice because this means you can only protect parts of the model you aren't using for anything in training. Things get a bit more interesting if you use sigmoids or something with distinct saturation regions, since then you can pass a single bit of info through per neuron while still protecting, though nobody uses sigmoid-like activations anymore (and the model can't just learn one, since that would then be changeable by the gradient).

Obstacles to gradient hacking

The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights () completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking.

I don't think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function that takes the outputs from two redundant copies of and outputs some kind of combined output. (If you're allowed functions that don't meet the "continuous everywhere, differentiable countably-almost everywhere" requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to and when they are equal, then it must be that at all points where , . There should also exist at least some where , since otherwise no longer depends on the pair of redundant networks at all which means that those networks can't actually affect what the network does which defeats the whole point of this in the first place.

Let us then define . Then, for all . This implies that is a constant function. Therefore, there do not exist where . This is a contradiction, and therefore cannot exist.

Call for research on evaluating alignment (funding + advice available)

I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we'd have a uniquely big comparative advantage at.

One very relevant piece of infrastructure we've built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. I've also thought about building some infrastructure to make it easier to coordinate the building of handmade benchmarks—this is currently on the back burner but if this would be helpful for anyone I'd definitely get it going again.

If anyone reading this is interested in collaborating, please DM me or drop by the #prosaic-alignment channel in the EleutherAI discord.

Looking forward to seeing the survey results!

By the way, if you're an alignment researcher and compute is your bottleneck, please send me a DM. EleutherAI already has a lot of compute resources (as well as a great community for discussing alignment and ML!), and we're very interested in providing compute for alignment researchers with minimal bureaucracy required.