Training Trace Priors

[-]gwern3y*50

I'm not sure branch coverage metrics are not easily beaten. I'm reminded of the Day & Night CA which is Turing-complete yet completely symmetric, or reversible computing like flow models. Or think of interpreters: at the level of interpreter operations, a malicious program can use the same number and quantity of operations as a good one, and may well have to if it's doing things like weird machines or return-oriented programming - if you build your hack out of gadgets found in code already installed on the system like Firefox, then it's going to look very similar, if not identical, to running Firefox (that's kinda the point).

Let's say I want to create a NN where I can encode arbitrary computations, including deceptive ones, which satisfy the property of 'every neuron fires half the time during training'. Aren't there lots of ways to do this? For example, just off the top of my head, I could have an input layer which 'chooses' one of two branches at random; the branches mirror the NN and have one half's neurons be the opposite of the other half, and at the output layer, throw away the opposite half's output; by construction, my activation statistics will now be exactly 50%: if a working neuron fires while computing an example, its mirror will not, and vice-versa, and they each run half the time. This can now represent anything that half the weights can represent, and can sculpt any distribution of activations you want, I think. This is an explicit construction but it could be implicit, like if it learns to repeat the input and masks out half the features, perhaps using some extremely variable property like the invisible-to-humans non-robust features classifiers exploit as its RNG. I'm sure it can be made more efficient too, if we allow the branches to swap computations carefully: if you can split any computation, I think you could just keep pulling the same trick (just keep splitting each branch in half).

[-]Adam Jermyn3y20

I agree that many coverage-style metrics can be broken, probably easily, and that this includes the priors I described. I also think your explicit construction is right, and is a special case of a concern I mentioned in the post ("changing the location on the circuit where the deceptive conditional gets evaluated").

I don't think the specific construction you mention is terribly problematic because it requires doubling the size of the circuit, which is easy to penalize with a circuit complexity prior, so I'm much more worried about implicit cases, which I think could get the penalty down to just a few extra gates. That's why I only think the current trace priors I know of only buy you a few bits of optimization pressure away from deception (you have to work just a little harder to hide what you're doing).

I'm currently looking for patches to this concern, but haven't found any yet with good properties (and maybe they just don't exist?). For instance, looking at correlations between two gates at a time handles the explicit construction, but is still vulnerable to this class of attack (e.g. if the circuit implements an interpreter it seems easy to arrange for any given calculation to land on a different part of the circuit in each training evaluation).

[-]gwern3y30

I see. I guess I would then say a broader concern with this sort of regularization approach is that it incentivizes the network to move towards networks which are made up of a highly distributed representation or one which very easily permutes its weights (both of which are things that happen already with no particular incentive), right from the start, not because it is traveling towards a deceptive network - it's far too stupid and unoptimized for deception to even be an option at initialization - but because this sort of regularization impedes normal learning.

You hint at this with the modularity section, but I think you get that problem even without modularity. Let's say that we are training a dog classifier, with no cats at all, and therefore modularity and deception cannot be involved even in theory; it should learn to output a probability of P(dog)=1, right? That should be super-easy, shouldn't it? But how will it do so when it has been initialized with a large number of random parameters which mean that dog photo by dog photo (each one radically different in pixel space, and translating to radically different embeddings after passing through the naive random initialization), it will have very different magnitude output and layer by layer or neuron by neuron activation distributions, and to update it to get closer to the right answer, it must inhibit various neurons from activating, activate others, work around 'dead neurons' which don't fire, and so on, all of which are in violent conflict with a regularizer forcing every neuron be activated half the time? If you enforce your desired correlations hard enough, it may not learn at all; if it does learn, it seems entirely possible to me that it may ignore the task entirely until it has finally bounced around into an area of model space where SGD can make P consistently increase towards 1 without the regularizer instantly undoing its work, because it found some sort of permutation or scramble which maintain the desired correlation patterns at the neuron level, 'fooling' the regularizer from your perspective, while updating the broader distributed representation and finally then solving the task. (Similar to grokking or patient teacher or wide basins or meta-learning approaches involving gradient descent.) Since beating the regularizer is a price that must be paid equally by deceptive and non-deceptive algorithms in this scenario, it has already happened by the time nontrivial learning begins, it no longer penalizes deception at all.

[-]Adam Jermyn3y10

I think I agree that the incentive points in that direction, though I'm not sure how strongly. My general intuition is that if certain wires in a circuit are always activated across the training distribution then something has gone wrong. Maybe this doesn't translate as well to neural networks (where there is more information conveyed than just 'True/False')? Does that suggest that there's a better way to implement this in the case of neural networks (maybe we should be talking about distributions of activations, and requesting that these be broad?).

On the specifics, I think I'm confused as to what your dog classifier is. What work is it doing, if it always outputs "this is a dog"? More generally, if a subcircuit always produces the same output I would rather have it replaced with constant wires.

[-]gwern3y40

What work is it doing, if it always outputs "this is a dog"?

My point is that, like in the AI koan, a random circuit, or a random NN, still does something. Like, if you feed in your dog photos, it'll start off predicting 1% for this one, 25.78% for that one, 99.76% for this other one... This is just because it is filled with random parameters at initialization and when you feed in your photos, each neuron computes something. Something totally nonsensical, but something nonetheless, and during that something, each neuron will have a distribution of activations which will almost surely not exactly equal 50% and not be independent of every other neuron. Thus, your NN is born steeped deep in sin from the perspective of the regularizer. Of course it could be replaced by a single wire, but 'replace all the parameters of a large complex NN with a single constant wire in a single step' is not an operation that SGD can do, so it won't. (What will it compute after it finally beats the regularizer and finds a set of parameters which will let SGD reduce its loss while still satisfying the regularization constraints? I'm not sure, but I bet it'd look like a nasty hash-like mess, which simply happens to be independent of its input on average.)

[-]Adam Jermyn3y20

Ok, I see. Thanks for explaining!

One thing to note, which might be a technical quibble, is that I don't endorse the entropy version of this prior (which is the one that wants 50/50 activations). I started off with it because it's simpler, but I think it breaks for exactly the reasons you say, which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N." This is very specifically so that there isn't a drive to unnaturally force the percentages towards 50% when the true input distribution is different from that.

Setting that aside: I think what this highlights is that the translation from "a prior over circuits" to "a regularizer for NN's" is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other. If I'm sampling boolean circuits from a one-gate trace prior I just immediately find the solution of 'they're all dogs, so put a constant wire in'. Whereas with neural networks we can't jump straight to that solution and may end up doing more contrived things along the way.

[-]gwern3y30

which is why I prefer the version that wants to see "Over the last N evaluations, each gate evaluated to T at least q times and to F at least q times, where q << N."

Yeah, I skipped over that because I don't see how one would implement that. That doesn't sound very differentiable? Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function? Even if you have some differentiable trick for that, it's easier to explain my objections concretely with 50%. But I don't have anything further to say about that at the moment.

Setting that aside: I think what this highlights is that the translation from "a prior over circuits" to "a regularizer for NN's" is pretty nontrivial, and things that are reasonably behaved in one space can be very bad in the other

Absolutely. You are messing around with weird machines and layers of interpreters, and simple security properties or simple translations go right out the window as soon as you have anything adversarial or optimization-related involved.

[-]Adam Jermyn3y10

Were you thinking of perhaps some sort of evolutionary approach with that as part of a fitness function?

That would work, yeah. I was thinking of an approach based on making ad-hoc updates to the weights (beyond SGD), but an evolutionary approach would be much cleaner!

[-]Ramana Kumar3y20

In this story deception is all about the model having hidden behaviors that never get triggered during training

Not necessarily - depends on how abstractly we're considering behaviours. (It also depends on how likely we are to detect the bad behaviours during training.)

Consider an AI trained on addition problems that is only exposed to a few problems that look like 1+3=4, 3+7=10, 2+5=7, 2+6=8 during training, where there are two summands which are each a single digit and they appear in ascending order. Now at inference time the model exposed to 10+2= outputs 12.

Have we triggered a hidden behaviour that was never encountered in training? Certainly these inputs were never encountered, and there's maybe a meaningful difference in the new input, since it involves multiple digits and out-of-order summands. But it seems possible that exactly the same learned algorithm is being applied now as was being applied during the late stages of training, and so there won't be some new parts of the model being activated for the first time.

Deceptive behaviour might be a natural consequence of the successful learned algorithms when they are exposed to appropriate inputs, rather than different machinery that was never triggered during training.

[-]Adam Jermyn3y10

Right. Maybe a better way to say it is:

Without hidden behaviors (suitably defined), you can't have deception.
With hidden behaviors, you can have deception.

The two together give a bit of a lever that I think we can use to bias away from deception if we can find the right operational notion of hidden behaviors.

[-]Nora Belrose3y00

This probably doesn't work, but have you thought about just using weight decay as a (partial) solution to this? In any sort of architecture with residual connections you should expect circuits to manifest as weights with nontrivial magnitude. If some set of weights isn't contributing to the loss then the gradients won't prevent them from being pushed toward zero by weight decay. Sort of a "use it or lose it" type thing. This seems a lot simpler and potentially more robust than other approaches.

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

4

4

One-Gate Trace Prior: Entropy

Normalization

Objection: Model Performance

Patch: Don’t use entropy

Relation to Performance

Implementation

Objection: Moving Computation

Summary