Yeah I think things like this are reasonable. I think that these are maybe too hard and high-level for a lot of the things I care about--I'm really interested in questions like "how much less reliable is the model about repeating names when the names are 100 tokens in the past instead of 50", which are much simpler and lower level.
Yeah I wrote an interface like this for personal use, maybe I should release it publicly.
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
Are any of these ancient discussions available anywhere?
In hindsight this is obviously closely related to what paul was saying here: https://ai-alignment.com/mundane-solutions-to-exotic-problems-395bad49fbe7
Another way of saying some of this: Suppose your model can gradient hack. Then it can probably also make useful-for-capabilities suggestions about what its parameters should be changed to. Therefore a competitive alignment scheme needs to be robust to a training procedure where your model gets to pick new parameters for itself. And so competitive alignment schemes are definitely completely fucked if the model wants to gradient hack.
[epistemic status: speculative]
A lot of the time, we consider our models to be functions from parameters and inputs to outputs, and we imagine training the parameters with SGD. One notable feature of this setup is that SGD isn’t by default purposefully trying to kill you--it might find a model that kills you, or a model that gradient hacks and then kills you, but this is more like incompetence/indifference on SGD’s part, rather than malice.
A plausible objection to this framing is that much of the knowledge of our models is probably going to be produced in other ways than SGD. For example, the models might write down various notes (in natural language or in neuralese) that they then read later, and they might have internal structures like economies that produce and consume information. Does this introduce new alignment problems?
Here’s a way I’ve been thinking about this recently. I’m writing this in way that might feel obnoxiously overwrought because this is the way that I think would have conveyed my current intuition to me two months ago.
In SGD, we update our weights by something like:
weights <- weights + alpha * (d loss/d weights)
You might think that this is fundamental. But actually it’s just a special case of the more general life rule:
do something that seems like a good idea, based on the best available estimate of what's a good idea
Imagine that you want a good language model, and you randomly initialize it and now you’re sitting at your computer unsure what to do next, because (as a totally random model) it’s totally useless to you. What’s a good idea to do?
Your best idea is to slightly perturb the model based on a local estimate of which direction in parameter space will reduce log loss, evaluated on a random batch. So you take that derivative and update your weights with it.
And then you do this over and over again, basically because you don’t have any better ideas for what to do.
(I think that the “we’re just trying to do something that seems good based on our best guess” perspective is also an intuition that I was missing for why adversarial training is good, or why using a more conservative oversight process during adversarial training might be good. We’re just trying to make an informed guess based on our available estimate about what parameter update will maximize goodness, and we think that the gradient of a conservative loss evaluated on adversarially chosen data might be a useful contributor to that best-guess parameter update.)
If you were training an AGI instead of a dumb LM, the situation would be different, because when you have a partially trained AGI, your sense of what’s a good idea might have materially changed, because the partially trained AGI can contribute to your effort to estimate what’s a good idea to do next.
The simplest case of this is that you use your AGI when you’re calculating the loss for an answer. This is the basic idea behind IDA/debate--during training, your sense of how good an answer is gets better for the obvious reason. And if your model can deceive you at this point, you’ll maybe pick bad losses, then get bad gradient estimates, then maybe eventually get paperclipped.
But we can also do more complicated things than estimating a loss. For example, at some point our AI might know something about what changes to its parameters would lead to outcomes that we liked better, or that would lead to more paperclips. At this point, we need to be able to access the model’s opinions about parameter changes that will lead to good outcomes for us (or else we’re not competitive), and we also want to avoid the model using its knowledge of parameter changes that would lead to paperclips to gradient hack.
(Just to rephrase/reiterate a point in that last paragraph that I hadn’t thought of before: If your model is able to gradient hack, then it’s apparently better than SGD at at least one aspect of guessing what parameter changes will lead to what long-term outcomes. And so for competitiveness reasons, you might need your weight update rule to be “ask the model for its favorite new choice of weights (perhaps giving it the loss gradient as an input that it can use as it wishes) and use its suggestion”. And if the AI is misaligned at this point, you are obviously completely fucked re gradient hacking. So it is kind of missing the point to argue about whether the model practically is able to gradient hack just via choosing actions.)
And so the hope for competitive alignment has to go via an inductive property--you’re using all the model’s relevant knowledge to update your model at every step, and so everything is going as fast as possible and maintaining alignment.
And this setup is basically the same for any other mechanism via which your AI might influence its future behavior, including writing notes-to-self or having some global memory bank or whatever.
And so in conclusion:
Something I think I’ve been historically wrong about:A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we’re training systems to answer questions, of course we’ll use our current versions of systems to help us evaluate, why would we not do that? We’ll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system.Similarly with debate--adversarial setups are pretty obvious and easy.In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”.It’s still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren’t used at all.
Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.
You aren't counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you'll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.
Am I correct that you wouldn't find a bound acceptable, you specifically want the exact maximum?