I’m worried about scenarios involving deceptive models. We’ve failed at inner alignment so the model has goals that are not aligned with ours. It can somehow detect when it’s in training, and during training it pretends to share our goals. During deployment, surprise! The model paperclips the universe.

In this story deception is all about the model having hidden behaviors that never get triggered during training. There are many inputs where the model does what we want. These correspond to inputs where it ‘plays along’. Then on some rare input or an input outside of the training distribution it defects, triggering some catastrophic behavior that corresponds to maximizing an ulterior motive.

This suggests that we should be wary of models that have logic/subcircuits/states that are never used during training, and we can penalize models for having such hidden behaviors.

In this post I'm focused on examples of *training trace priors*, which are priors that make use of execution traces from the training distribution. I think these provide at least a few bits of optimization away from deception.

These are very much working notes, and feedback is very welcome.

# One-Gate Trace Prior: Entropy

Consider a boolean circuit. The output of each gate is 0/1. One way to implement the entropy prior is to record traces of these outputs during training, and penalize the model for having strong bias between 0/1:

Here is the observed fraction of the time gate produces output on the training distribution. We call this the one-gate trace prior because it relies on probabilities defined over single gates. This prior favors models where each gate has even odds of producing 0/1.

The intuition for why this might work is that deceptive models need some machinery specifying when and how they defect, and *this machinery is never active during training*. This is at the core of what makes them bad! If the defecting machinery activated during training we’d notice because the model would stop optimizing the outer goal, so we’d assign it a high loss and move on to better models.

Another framing is: It is inherently suspicious if a model has machinery that is rarely activated. That machinery wasn’t vetted during training, so we have no reason to trust it. If the model has a hidden love of paperclips, this is the natural place to hide it.

As a technical detail, in order to do useful computation the model may need access to some constant input wires, and these should be excluded from the prior.

## Normalization

The one-gate trace prior isn’t normalized, and can’t be normalized without an additional prior over e.g. complexity. So in reasoning about it we need to remember that there will always be another piece to the prior setting the distribution over circuits. I’m agnostic to the details of that prior, and view the one-gate trace prior as a regularizer we apply to other (normalized) priors to make them safer.

## Objection: Model Performance

This prior could cause a performance hit. The intuition is that sometimes a model will need to respond to unusual events even in the training distribution, so forcing the distribution over gate outputs to be as uniform as possible could make it hard to act in environments containing rare events.

In particular, this prior disfavors modularity. Suppose the model is meant to classify images. It might be natural have separate sub-circuits for recognizing cats and dogs. If cats are much more likely than dogs during training, our prior will disfavor models that separate that machinery out, because the ‘dog’ sub-circuit will rarely be active and the ‘cat’ sub-circuit will often be active.

If the pressure to perform well during training is strong enough this could push models towards obfuscating their calculations just to hide from the prior, which is bad.

# Patch: Don’t use entropy

We can make our prior more forgiving of modularity and biased inputs by just disfavoring really rarely activations, e.g.

Here is a positive number, and as we make large the prior becomes indifferent to the middle of the probability distribution. For example, below I’ve plotted over for :

How should we pick ? We just want assurances that all the gears in the model got tried out at least a few times during training. So we’re looking to see that each gate produced each output at least times during training, where is fixed.

Suppose we have training samples. We want our prior to be flat over and to penalize models outside that window, which means . That is, as we get more data we can make our prior indifferent to larger windows because we’re able to observe rarer behavior.

In this framing, is a sort of safety parameter: larger means the prior is more suspicious of uneven distributions and requires that we see more examples before we trust the model.

## Relation to Performance

This patch resolves the performance issues we saw using entropy. Modular circuits are totally fine under this prior so long as they all get checked out during training. Even if dog images come up less often than cat images, they still have to come up a decent number of times for us to be confident the model has learned what a dog is, and if that number is at least then we won’t penalize the model for having separate dog and cat circuits.

## Implementation

A naive way to implement this prior is to run each circuit extra times per training evaluation, and to use traces from those extra evaluations to compute the prior. This is obviously expensive if we make large, though only linearly so.

If we’re dealing with a neural network instead of a boolean circuit we can reinforce pathways taken during training. Specifically, the first times a neuron is active (has a positive sum), and the first times it is inactive, we reinforce all weights emerging from that neuron.

As a technical detail, activations early in training probably shouldn’t ‘count’ as exploring the model’s behavior space. That can probably be addressed by only tracking the most recent training samples when evaluating if a given neuron has activated times.

## Objection: Moving Computation

One way that deception can hide from this prior is by changing the location on the circuit where the deceptive conditional (”Am I in training or deployment?”) gets evaluated. For instance if the model can make this conditional get evaluated in different places in the circuit depending on the inputs it receives, it is no longer helpful to compare the behavior of any single gate across evaluations.

This seems like a very big problem, because only takes a small number of bits of model complexity to make the relevant evaluation shift around on the circuit. In a future post I’ll flesh out this failure mode further, and consider additional patches.

# Summary

Training trace priors provide a way of looking for hidden behaviors, which are signs of deception. This makes them very closely related to methods of *fuzzing*, where automated systems try to explore the space of program behaviors.

The specific priors I’ve constructed above are probably not too helpful on their own, but my hope is that there is a version of a trace prior that more strongly pushes away from deception.

*(Thanks to Evan Hubinger and Nicholas Schiefer for suggestions and discussions around these ideas)*

I'm not sure branch coverage metrics are not easily beaten. I'm reminded of the Day & Night CA which is Turing-complete yet completely symmetric, or reversible computing like flow models. Or think of interpreters: at the level of interpreter operations, a malicious program can use the same number and quantity of operations as a good one, and may well

haveto if it's doing things like weird machines or return-oriented programming - if you build your hack out of gadgets found in code already installed on the system like Firefox, then it's going to look very similar, if not identical, to running Firefox (that's kinda the point).Let's say I want to create a NN where I can encode arbitrary computations, including deceptive ones, which satisfy the property of 'every neuron fires half the time during training'. Aren't there lots of ways to do this? For example, just off the top of my head, I could have an input layer which 'chooses' one of two branches at random; the branches mirror the NN and have one half's neurons be the opposite of the other half, and at the output layer, throw away the opposite half's output; by construction, my activation statistics will now be exactly 50%: if a working neuron fires while computing an example, its mirror will not, and vice-versa, and they each run half the time. This can now represent anything that half the weights can represent, and can sculpt any distribution of activations you want, I think. This an explicit construction but it could be implicit, like if it learns to repeat the input and masks out half the features, perhaps using some extremely variable property like the invisible-to-humans non-robust features classifiers exploit as its RNG. I'm sure it can be made more efficient too, if we allow the branches to swap computations carefully: if you can split any computation, I think you could just keep pulling the same trick (just keep splitting each branch in half).

I agree that many coverage-style metrics can be broken, probably easily, and that this includes the priors I described. I also think your explicit construction is right, and is a special case of a concern I mentioned in the post ("changing the location on the circuit where the deceptive conditional gets evaluated").

I don't think the specific construction you mention is terribly problematic because it requires doubling the size of the circuit, which is easy to penalize with a circuit complexity prior, so I'm much more worried about implicit cases, which I think could get the penalty down to just a few extra gates. That's why I only think the current trace priors I know of only buy you a few bits of optimization pressure away from deception (you have to work just a little harder to hide what you're doing).

I'm currently looking for patches to this concern, but haven't found any yet with good properties (and maybe they just don't exist?). For instance, looking at correlations between two gates at a time handles the explicit construction, but is still vulnerable to this class of attack (e.g. if the circuit implements an interpreter it seems easy to arrange for any given calculation to land on a different part of the circuit in each training evaluation).

I see. I guess I would then say a broader concern with this sort of regularization approach is that it incentivizes the network to move towards networks which are made up of a highly distributed representation or one which very easily permutes its weights (both of which are things that happen already with no particular incentive), right from the start, not because it is traveling towards a deceptive network - it's far too stupid and unoptimized for deception to even be an option at initialization - but because this sort of regularization impedes normal learning.

You hint at this with the modularity section, but I think you get that problem even without modularity. Let's say that we are training a dog classifier, with no cats at all, and therefore modularity and deception cannot be involved even in theory; it should learn to output a probability of P(dog)=1, right? That should be super-easy, shouldn't it? But how will it do so when it has been initialized with a large number of random parameters which mean that dog photo by dog photo (each one radically different in pixel space, and translating to radically different embeddings after passing through the naive random initialization), it will have very different magnitude output and layer by layer or neuron by neuron activation distributions, and to update it to get closer to the right answer, it must inhibit various neurons from activating, activate others, work around 'dead neurons' which don't fire, and so on, all of which are in violent conflict with a regularizer forcing every neuron be activated half the time? If you enforce your desired correlations hard enough, it may not learn at all; if it does learn, it seems entirely possible to me that it may ignore the task entirely until it has finally bounced around into an area of model space where SGD can make P consistently increase towards 1 without the regularizer instantly undoing its work, because it found some sort of permutation or scramble which maintain the desired correlation patterns at the neuron level, 'fooling' the regularizer from your perspective, while updating the broader distributed representation and finally then solving the task. (Similar to grokking or patient teacher or wide basins or meta-learning approaches involving gradient descent.) Since beating the regularizer is a price that must be paid equally by deceptive and non-deceptive algorithms in this scenario, it has already happened by the time nontrivial learning begins, it no longer penalizes deception at all.

I think I agree that the incentive points in that direction, though I'm not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn't translate as well to neural networks (where there is more information conveyed than just 'True/False')? Does that suggest that there's a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?).

On the specifics, I think I'm confused as to what your dog classifier is. What work is it doing, if it always outputs "this is a dog"? More generally, if a subcircuit always produces the same output I would rather have it replaced with constant wires.

My point is that, like in the AI koan, a random circuit, or a random NN, still does

something. Like, if you feed in your dog photos, it'll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one... This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activations which will almost surely not exactly equal 50% and not be independent of every other neuron. Thus, your NN is born steeped deep in sin from the perspective of the regularizer. Of course it could be replaced by a single wire, but 'replace all the parameters of a large complex NN with a single constant wire in a single step' is not an operation that SGD can do, so it won't. (What will it compute after it finally beats the regularizer and finds a set of parameters which will let SGD reduce its loss while still satisfying the regularization constraints? I'm not sure, but I bet it'd look like a nasty hash-like mess, which simply happens to be independent of its input on average.)Ok, I see. Thanks for explaining!

One thing to note, which might be a technical quibble, is that I don't endorse the entropy version of this prior (which is the one that wants 50/50 activations). I started off with it because it's simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N." This is very specifically so that there isn't a drive to unnaturally force the percentages towards 50% when the true input distribution is different from that.

Setting that aside: I think what this highlights is that the translation from "a prior over circuits" to "a regularizer for NN's" is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other. If I'm sampling boolean circuits from a one-gate trace prior I just immediately find the solution of 'they're all dogs, so put a constant wire in'. Whereas with neural networks we can't jump straight to that solution and may end up doing more contrived things along the way.

Yeah, I skipped over that because I don't see how one would implement that. That doesn't sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it's easier to explain my objections concretely with 50%. But I don't have anything further to say about that at the moment.

Absolutely. You are messing around with weird machines and layers of interpreters, and simple security properties or simple translations go right out the window as soon as you have anything adversarial or optimization-related involved.

That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!

Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)

Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.

Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there's maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won't be some new parts of the model being activated for the first time.

Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.

Right. Maybe a better way to say it is:

The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.

This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.