I’m looking for alignment techniques that are indefinitely scalable and that work in any situation we can dream up. That means I spend time thinking about “exotic” problems — like AI systems reasoning about their own training process or about humanity’s far future.

Yet I’m very optimistic about finding practical alignment strategies that are relatively simple and usable today. I expect the results of my research to look mundane and even boring.

These two stances may appear to be in tension. If I’m worrying about all kinds of exotic considerations, how can the end result be something mundane?

I’m not too concerned. These failures seem exotic because they involve exotic capabilities. I’m looking for a mundane algorithm that trains the model to use whatever capabilities it has for good purposes. So it will lead to exotic countermeasures if and when the model has exotic capabilities, i.e. just in time to avoid exotic failures (by which time they’ll no longer feel exotic).

I want to illustrate this phenomenon with a few examples.

Example 1: Gradient hacking

Gradient hacking is a possible failure mode where a powerful misaligned model behaves in a way that is apparently benign but biases the gradient of the loss to point in an unfavorable direction, eventually trapping optimization at a bad local minimum. This feels exotic today because it requires the model to have an incredible amount of introspective access and control, as well as strategic awareness over long time horizons. (“Forward Pass” by Andrej Karpathy explores some similar ideas.)

Normally I imagine using SGD to optimize a “behavioral” objective: to update the parameters θ you use them to make a randomized decision y on an input x, then you calculate a reward r(x, y), then you compute an unbiased estimate for the gradient of the expectation r(x, y) with respect to θ.

Gradient hacking is troubling because it suggests that if we optimize a purely behavioral objective we could get a bad outcome even if the reward function r(x, y) perfectly captures the quality of the output y.

Rather than fixing this problem in a specific way, I’d like to rely on competitiveness. That is, assume that (throughout training) we have a question-answering policy Q → A which reflects everything that our learned model “knows” about the world.

Then it seems like the best way to compute the system’s loss should be by by asking some question like “What should the loss be for the current parameters?”

Our usual behavioral loss then is a great heuristic for computing the loss (which we can justify on principled Bayesian grounds if we want). In mundane situations it’s not clear we have anything further to say about what the loss should be, so “What should the loss be?” is almost the same as hard-coding the heuristic.

But when gradient hacking occurs, we are no longer in a mundane situation. By hypothesis, our learned model is able to reason introspectively about its own parameters and the structure of the loss landscape! It is thinking about how to change its behavior in order to affect the loss, change the optimization trajectory, and ultimately disempower humans.

That means that a competitive question-answerer also is able to answer questions about its own parameters and the structure of the loss landscape. It is able to talk about how assigning a loss could mess up the optimization trajectory and ultimately disempower humans. If we incorporate all that knowledge into our assessment of “What should the loss be for the current parameters?” then it will assign a low loss to some of the actions that the gradient-hacker tried to sabotage. That is, the malicious AI can outsmart the behavioral heuristic, but it can’t outsmart an overseer who knows everything that it knows.

It’s not clear if this works (and my strong guess is that we’ll ultimately end up with an algorithm that looks quite different from this). But overall I think that if gradient hacking actually came up, it’s more likely than not that the core difficulty in avoiding it would be the same version of epistemic competitiveness that we need in more mundane situations. The only real change to our algorithm we need to make is applying a competitive overseer to the right question.

You might worry that the knowledge a model has about its own parameters is qualitatively different than the knowledge it has about the world. But I don’t think so — they are both produced by transfer from the task of “get a low training loss,” combined with a bunch of computation. It’s possible that after we have an epistemically competitive solution we’ll see that it doesn’t apply to a model’s introspective knowledge. If that happens then we may indeed need some more exotic solution that talks about introspection per se, but I personally doubt it.

Example 2: far future forecasts and inaccessible influence

Suppose that there’s some domain that our AI can understand but we cannot, for vividness you can imagine humans without microscopes and an AI that has a working model of the microscopic world. For simplicity, imagine that the world factors nicely into an accessible part and an inaccessible part, and that there are influences in both directions but to us the the (inaccessible) → (accessible) influences just look like noise. (If you don’t know about the mechanics of biology, then events like “getting cancer” can just look like noise.)

One of my concerns is that an AI that understands the inaccessible part may be able to cause trouble in the very long term. Even if we’re able to prevent the AI from doing anything overtly bad in the accessible part for the next few years (like killing humans), it may still be that humans aren’t really in control of the situation because the real power is in the inaccessible part of the world, and that power will eventually assert itself.

So if we can only observe the accessible part of the world, then we would need to look very far ahead to avoid problem. Basically we’d need to look far enough ahead that the future humans have done science and learned about the previously-inaccessible parts of the world. This kind of lookahead feels particularly exotic because it involves extreme generalization. Training typically involves feedback loops over seconds, but now we are talking about a model that makes good predictions about events that literally occur generations from now.

That kind of extreme lookahead sounds exotic, but in some cases my current alignment plans require it.

As in the last example, I think this situation is only exotic because our model had exotic capabilities. We’re in this situation because we are trying to handle a malicious AI that makes plans that have no observable consequences for many generations, building up power in the inaccessible part of the world that it predicts will eventually interfere with humans. That AI is already apparently able to generalize well to extremely long time horizons — if it wasn’t, we’d have no problem at all.

So once again, to avoid this problem we really just need epistemically competitive question answering, i.e. we need to learn a Q → A function that generalizes as well to novel questions (e.g. about the future) as the underlying model’s knowledge. And that’s a pretty mundane problem that’s core to the idea of epistemic competitiveness that we need to solve for other reasons.

If the malicious model is able to generalize from its experiences to make plans that unfold over generations, and the Q → A model generalizes as well as the malicious model, then our Q → A system can also answer questions about what happens over generations. That only sounds exotic because we assumed some exotic capabilities. Nothing about the solution or training needs to deal with ultra-long horizons specifically.


In my work I don’t shy away from exotic problems (I often find them useful as extreme cases to illustrate some principle). At the same time, I’m aiming for mundane solutions and optimistic about finding them.

I think those positions are consistent because my intermediate goal is to ensure that the oversight process is able to leverage all of the capabilities developed by the model — so if the model develops exotic capabilities which pose exotic challenges, then we get an exotic oversight process automatically.

Mundane solutions to exotic problems was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

New Comment
8 comments, sorted by Click to highlight new comments since:

FYI: I would find it useful if you said somewhere what 'epistemic competitiveness' means and linked to it when using the term.

I assume the right pointer is ascription universality.

Your link is broken.

For reference, the first post in Paul's ascription universality sequence can be found here (also Adam has a summary here).

Sorry about that. I corrected it but it was indeed the first link you gave.

Planned summary for the Alignment Newsletter:

The author’s goal is to find “mundane” or simple algorithms that solve even “exotic” problems in AI alignment. Why should we expect this is possible? If an AI system is using powerful, exotic capabilities to evade detection, shouldn’t we need powerful, exotic algorithms to fight that? The key idea here is that we can instead have a mundane algorithm that leverages the exotic capabilities of the AI system to produce an exotic oversight process. For example, we could imagine that a mundane algorithm could be used to create a question-answerer that knows everything the model knows. We could then address <@gradient hacking@>(@Gradient hacking@) by asking the question “what should the loss be?” In this case, our model has an exotic capability: very strong introspective access to its own reasoning and the training process that modifies it. (This is what is needed to successfully hack gradients). As a result, our question answerer should be able to leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients, even if our normal hardcoded loss would not do so.

leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients

This seems slightly confusing/unclear---I'm not imagining penalizing the model for trying to hack the gradients, I'm imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters  are in the direction of more aligned models, and it could hijack the training process by ensuring that  gets a high loss. So it tries to behave badly when its own parameters are , trying to prevent gradient descent from converging to much better parameters . But then the overseer knows that it wants training to move towards parameters , in order to ultimately reach much better parameters , so it assigns  a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).

In this case no model is getting penalized for anything, and in particular we aren't trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to "outsmart" the behavioral heuristic in order to manipulate training.

Ah, I see, that makes sense. I had in fact misunderstood what you were saying here.

That being said, why not penalize the model when it engages in gradient hacking, so that it becomes less likely to do it in the future?

It's not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can't be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)