Sorted by New

# Wiki Contributions

Here's an argument that I think works in the limit where . For very large , each subset  of size  will occur many times as an . Moreover, for each set of coodinate values , there will be many  such that  and . Therefore, using the law of large numbers for any subset  of size  and any set of coordinate values , Bob can infer , i.e. the number of inputs  that agree with  at the coordinates indexed by  for which . Additionally, the random variable  contains no other information about  that is not contained in this set of statistics.

Given a function , let  denote the set of all functions  such that for each subset  of size  and each set of coordinate values ,  . The argument of the previous paragraph then implies that upon observing  Bob can deduce the equivalence class , but no other information about  from .

If we have a random variable  and we define another random variable  as a deterministic function , then it is easy to show that the set of distributions of  which maximize the mutual information between  and  are precisely those for which  has a uniform distribution. Applying this to the above setup, we see that the distributions over  which maximize the channel capacity are those for which  is uniformly distributed over all the values that it can take.

If we suppose in addition that  has the maximum entropy distribution subject to this constraint, we find that:

.

Intuitively, this says that the probability of a given function  is inversely proportional to the number of other functions  that have the same number of ones on every hyperplane defined by fixing a set of some  coordinates . This seems to correspond roughly to sensitivity: we intuitively expect there to be a lot of such functions  when the number of ones that  outputs on most hyperplanes is approximately half of the number of total points on that hyperplane, and saying that a function's output is approximately half ones and half zeros on a given hyperplane is roughly saying that  is sensitive to the remaining unspecified coordinates.

It's not obvious to me that the above expression for  is feasible to compute in practice, but I think it is fairly naturally interpretable.

The key reason the problem is tractable in the case  is that the law of large numbers means the probabilities wash out and we get an essentially deterministic problem. In the case where  is finite, this won't happen and in particular I expect you'll run into complications that arise from interaction terms where the method for combining the information from two observations  and  with  is not clean.

I expect you're more likely to end up with a tractable solution if you rephrase the problem statement slightly so that the subsets of inputs over which you're agregating outputs when observed (in the case above, these subsets are the hyperplanes defined by the coordinates ) are disjoint or overlapping. (For example, taking all the  to be random but equal would ensure this.) It strikes me that there might be a formulation like this that still captures the key features of the problem you're interested in, but I've not thought about this in detail.

It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:

1. There exists an efficient algorithm which when run on a model input  will output which mechanism model  uses when run on .
2. There exists an efficient algorithm which when run on a model will output another programme  such that when we run  on a model input  it will output (in reasonable time) which mechanism  uses when run on .

In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)

In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).

It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability - but I think in this case we’d really need (2) in order for these methods to be useful in general.

Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).

But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)

Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.