Obstacles to gradient hacking

The ultimate goal of this project is to exhibit a handcrafted set of model weights for a reasonably noncontrived model architecture which, when tuned with SGD, results in some malicious subnetwork in the model learning some mesaobjective that we specified via some section of the model weights () completely different from the base objective, without the entire mesaoptimizer falling apart in the process. We haven't succeeded at this goal yet but I would say this goal is very much to exhibit gradient hacking. 

I don't think redundancy will work. Suppose you have some continuous everywhere, differentiable countably-almost everywhere combining function  that takes the outputs from two redundant copies of  and outputs some kind of combined output. (If you're allowed functions that don't meet the "continuous everywhere, differentiable countably-almost everywhere" requirement, you might as well just skip the whole redundancy thing and just use a staircase.) Since this function prevents any gradients to  and  when they are equal, then it must be that at all points where . There should also exist at least some  where , since otherwise  no longer depends on the pair of redundant networks at all which means that those networks can't actually affect what the network does which defeats the whole point of this in the first place.

Let us then define . Then,  for all . This implies that  is a constant function. Therefore, there do not exist  where . This is a contradiction, and therefore  cannot exist.

Call for research on evaluating alignment (funding + advice available)

I think this is something I and many others at EleutherAI would be very interested in working on, since it seems like something that we'd have a uniquely big comparative advantage at. 

One very relevant piece of infrastructure we've built is our evaluation framework, which we use for all of our evaluation since it makes it really easy to evaluate your task on GPT-2/3/Neo/NeoX/J etc. We also have a bunch of other useful LM related resources, like intermediate checkpoints for GPT-J-6B that we are looking to use in our interpretability work, for example. I've also thought about building some infrastructure to make it easier to coordinate the building of handmade benchmarks—this is currently on the back burner but if this would be helpful for anyone I'd definitely get it going again.

If anyone reading this is interested in collaborating, please DM me or drop by the #prosaic-alignment channel in the EleutherAI discord.