Will transparency help catch deception? Perhaps not

Matthew Barnett

In the last few months, some AI safety researchers have signaled what Wei Dai called a "wave of optimism" for using transparency tools to catch deceptive behavior in our machine learning models. By a transparency tool, I mean any technique that allows us to peek inside a model to get a better idea of what goes on inside of it. This post outlines one reason to be pessimistic about the usefulness of using transparency tools to catch advanced deception.

First, I construct a model for why we need transparency tools in the first place. I imagine the training process as a game played by two adversaries: a deceptive model, and an overseer trying to catch any deceptive reasoning. The issue for the overseer is that the deceptive model is opaque, which means that the overseer must use some tools to interpret its thoughts and behavior.

While by design, the deceptive model and overseer are comparably intelligent, the nature of using transparency tools enables a "weakest link" dynamic, implying that the deceptive model can exploit weaknesses in the transparency procedure to fool the overseer. In particular, deception can be hidden if the deceptive model has a good understanding of the deployment environment that isn't revealed while using transparency tools.

In the end, I consider whether this weakness implies that transparency tools act as unnecessary middlemen for catching deception, and use an analogy to GANs to argue that training an overseer to detect deception end-to-end might work better.

The model

Our goal is to take a neural network and detect whether there are any inputs to the network that will cause it to act unacceptably. By assumption, we are concerned about deceptive alignment, a scenario where a misaligned model understands its training process, and will therefore act aligned during training while plotting something else for later.

Since we want to detect deceptive thoughts, we are interested in opening up the model and seeing what goes on inside. We have access to an overseer that interprets the model and spends time thinking whether the model has inputs which cause it to act catastrophically, which will allow us to penalize the model accordingly. Crucially, this overseer is not able to understand the model weights by themselves: instead it relies on a transparency procedure that converts the model into a relatable abstraction, which it then uses to determine whether the model is deceptive.

Formally, we can imagine the dynamic as the following. We are training a model, $m$ which comes from a class of models $M$ . A transparency function maps models in $M$ to an abstract description of what the model does, $T : M \to D$ . Descriptions can be construed very broadly, and they can include natural language descriptions of the model's decision process, charts and graphs, or even pseudocode. This description is fed to an overseer $O : D \to {True, False}$ , which detects whether the description implies that the model will act deceptively in the deployment environment.

We can think of the transparency function as making a tradeoff between two variables: brevity of the description, and faithfulness of the description to the model $m$ . If the description is too long, or complex, then there is no point of using a transparency function in the first place, since it doesn't actually make the model any more transparent. On the other hand, if the transparency function produces descriptions that are too short, then another problem occurs: the overseer will miss information that allows it to deduce the deceptive capabilities of $m$ .

Achieving a good balance of these variables will determine whether we should use transparency tools in the first place. A good transparency scheme will ideally provide a concise description of how the model makes decisions, while ensuring that nothing important is left out. We will now explore an example of this dynamic in an image classifier, and I will suggest that striking a perfect balance could be difficult.

A deceptive image classifier

Suppose we are training a neural network to recognize cars using some advanced neural network design. This neural network is smarter than your average convolutional neural network. In particular, rather than imagining it only as a set of neural layers with edge detectors, we also imagine that it has adaptive computational layers that can change depending on the gradient updates. At the extreme end, the training process could even nudge the neural network into arbitrary computational graphs, which allows for if-statements, explicit memory, and loops.

Allowing the model to adapt its structure during the gradient updates is very important, as it enables the model to be more robust to different domain distributions. For illustration of this effect, consider if the gradient updates produce the following schematic for the model.

This design allows the model to be robust to different type of lighting, which demonstrates the utility of the adaptive approach. However, there's also a dark side to this adaptivity. While normally we think of models having only a choice over their input/output behavior, a sufficiently competent model could also understand the whole training procedure, which means that it could use gradient hacking to route a path through the space of computational graphs.

Since the deceptive model can ask questions about what setting they are in, they can exploit this functionality by inserting an if-statement that checks whether they are currently inside the training environment. During training, we can expect that the model will act safely. During deployment, however, the switch will be activated and the model will pursue its own devious plans.

That's where the transparency tools and overseer come in.

The transparency function has the job of creating a relatable description of the model that would allow the overseer to catch any attempted deception. Of course, the transparency function can't just rely on using the input/output behavior of the model during training to produce a faithful description. Doing this would create a misleading impression, as it wouldn't allow the overseer to detect the dangers involved in a distribution shift.

Instead, the transparency function must convert the pile of matrices, and thus the internal structure of the model, into a relatable abstraction. But now this allows for the following troubling dynamic.

If the deceptive model understands how the transparency tools work, then it can modify its own architecture so that the transparency function leaves out important information. In the above example, the deceptive model might realize that the deployment environment will occur at night, and therefore use this if-statement as its signal to perform a treacherous turn. Further, since the deceptive model understands the transparency tool, it knows how to place if-statements that will be glossed over.

The transparency tools are intended to highlight decisions like this, but by assumption, the transparency tools are less 'intelligent' than the deceptive model. Therefore, it may choose to omit this if-statement in its description of the model, and focus on information it believes is more critical.

In other words, not realizing that the nighttime-check was the signal for a treacherous turn, the transparency tool glosses over this if-statement, and doesn't provide it to the overseer.

This ends in catastrophe once the deception is successful. The overseer is comparably intelligent to the deceptive model, but since the if-statement above was omitted in the transparency description, the overseer was not able to deduce the dangerous off-distribution behavior.

Therefore, the deceptive model successfully fooled the overseer, even though both players in the game were comparably intelligent.

Why are we assuming that the transparency tools are less intelligent than the deceptive model?

An immediate objection is simply, "Yes but a good transparency tool would not have omitted the if-statement." Ultimately, I think this is a weak objection.

The first thing to note is that is that the transparency function wasn't totally dumb. Understanding the nature of the deployment environment is hard. The transparency function was just doing its job by trading off brevity and faithfulness. To make this point more compelling, we can consider that the true computational graph was really a large decision tree, with a huge branching factor for the various types of situations the model could be in.

In my example, the transparency tool is really just doing its best job of pruning the irrelevant parts of the tree. If you didn't know the important detail that the deployment environment was at night, then checking whether there is dark or dim lighting could simply be seen as a side feature that allowed the model to be more robust, and something like that can be omitted.

More importantly, if we allow the transparency tools to be more intelligent than the deceptive model, we must question the role that the overseer is playing in this dynamic. For instance, if the transparency model was exceptionally competent, then all it needs to do in its description is either say, "This model is deceptive" or say, "This model is not deceptive." Then, the overseer's job is trivial.

This suggests that decomposing safety-checking into two components: a transparency tool, and an overseer that interprets the results of the tool, is a bad model. If the transparency tools are too weak, then the deceptive model will exploit this weakness to fool the overseer, even if the overseer is very intelligent. On the other hand, if the tools are very strong, then the overseer plays no role at all -- suggesting that there was never any need to create interpretability tools, since a deception-checker trained end-to-end would have worked fine.

To better highlight why I think this casts doubt on transparency tools, I offer an analogy with GANs.

A fictional suggestion for training GANs

GANs are trained via an adversarial game between two neural networks. One neural network -- the generator -- is trained to produce a counterfeit sample when it is given some latent noise, and the other -- the discriminator -- is trained to detect whether a sample is counterfeit or from the real distribution.

Suppose someone were to propose the following way to help the discriminator: rather than training it end-to-end to help detect counterfeit samples, we instead make its input equal to the output of a transparency tool that gives it insight into how the current sample looks. The transparency tool could, for instance, highlight if it detects any weird shadows or bad lighting, and then feed this information to the discriminator to produce a final determination.

"Since the transparency tool focuses on the weaknesses of the generator, this will allow the discriminator to more easily detect counterfeit samples," the person argues.

But what the person failed to consider was that all this tool did was relax the job of the generator, since now all it needs to do is get past the weird-shadow and bad-lighting detector stage. It doesn't even matter how competent the discriminator is, since it's only given information from the transparency tool.

The person replies, "OK fine, we can give the discriminator the output from the tool, and the original sample. That way it can use the tool as a bonus to help it make the determination."

But now, the discriminator is operating over a larger dimension input-space than it would have otherwise, which makes its job harder. If it was simply trained end-to-end, then it could have developed a weird-shadow detector and bad-lighting detector on its own, if that were appropriate. But now it has extraneous information, which may or may not be useful for actually detecting counterfeit samples.

Conclusion

Contra Betteridge's law of headlines, I still think that transparency tools can be useful for catching deception. Also, this argument does not imply that transparency tools are useless for tasks other than catching deception. The biggest weakness with my argument is that the alternative I offered was training an overseer end-to-end, which is not possible if we are assuming a human is the overseer.

However, I don't think that this weakness is fatal to my argument: in order for transparency tools to work well, they must be strong enough that their weaknesses can't be exploited by a deceptive model. But we just argued above that if the tools were that strong, they would already be indistinguishable from a deception checker trained end-to-end.

Finally, I suspect people will object to my argument by saying that I am taking a very narrow approach to transparency tools, and that a model trained end-to-end to detect deception still counts as a transparency tool. I have little to say about this objection other than, I'm not arguing against that type of tool.

I agree that once you have a fully deceptively aligned model that's plotting against you and trying to figure out how to fool your transparency tools, there's not much hope left for you to be able to detect and remedy that issue. Importantly, however, that's not my model for how transparency could help you catch deception. Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where the model is trying to trick the transparency tools. This is especially important in the context of gradient hacking (as I mention in that post), since you have to inspect the entire training process if you want to know whether any gradient hacking occurred.

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost. Instead, you want to be in the regime where your training process is constructed so as to steer clear of situations in which your model might become deceptive in the first place. I see enforcing something like corrigibility or myopia as a way of doing this: before your model can become deceptive, it has to become non-myopic, which means if you can detect your model starting to become non-myopic, then you can prevent it from ever becoming deceptive.

From "Relaxed adversarial training for inner alignment:"

[For the recursion problem,] the recursion doesn't actually have to work for all $M \in M$ —rather, we only need our acceptability guarantee to hold along any path through model space that could actually be traversed by our training procedure.

Thus, it's fine if our procedure fails if given a sufficiently unacceptable model, as long as our procedure can prevent such models from ever arising. For example, if we always used an old version of $M$ that we were confident in for implementing $Amp (M)$ , then we would just need it to be the case that $M$ couldn't get to the point where it would be able to deceive $Amp (M_{old})$ in-between oversight.

Rather, the idea is that by using your transparency tools + overseer to guide your training process, you can prevent your training process from ever entering the regime where your model is trying to trick you. This is especially important in the context of gradient hacking (as I mention in that post)

Indeed. I re-read the post and I noticed that I hadn't realized how much of your reasoning applied directly as a reply to my argument.

Put another way: once you're playing the game where I can hand you any model and then you have to figure out whether it's deceptive or not, you've already lost.

For the record, I think this is probably false (unless the agent is much smarter than you). The analogy I had with GANs was an example of how this isn't necessarily true. If you just allowed an overseer to inspect the model, and trained that overseer end-to-end to detect deception, it would be very good at doing so. The main point of this post was not to give general pessimism about detecting deception, but to offer the idea that having a "tool" (as I defined it) for the overseer to use provides little more than a crutch, decreasing its effectiveness.

Now it is probably also true that monitoring the model's path through model-space during training would greatly enhance the safety guarantee. But I wouldn't go as far as saying that the alternative is doomed.

+1 and a frame that might be helpful is figuring out when adversarial training happens anyway and why. I think this happens with human management and might be illustrative.

My summary for the Alignment Newsletter:

Recent posts have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that's about as smart as the deceptive model, so that the deceptive model can't fool it.

My opinion:

In a comment, Evan Hubinger makes a point I agree with: the transparency tools don't need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn't "suddenly" become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.

It seems like you're now putting most of your hope on training a deception checker end-to-end, but that seems highly non-trivial to me. Can you talk about how you hope to do that? For example what training data would you use and why do you think the deception checker will be able to correctly generalize from the training examples to the real models that the it will be applied to?