Can we efficiently distinguish different mechanisms?

paulfchristiano

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.)

Background

We’d like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity.

I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren’t this simple. ARC’s current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally.

Are distinct mechanisms enough?

ARC has been looking for training strategies that avoid this problem by leveraging only the fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution.

More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering.

It’s unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism.

If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms.

Roadmap

In this post I want to explore this situation in a bit more detail. In particular, I will:

Describe what it might look like to have a pair of qualitatively distinct mechanisms that are intractable to distinguish.
Discuss the plausibility of that situation and some reasons to think it’s possible in theory.
Emphasize how problematic that situation would be for many existing approaches to alignment.
Discuss four candidates for ways to solve the sensor tampering problem even if we can’t distinguish different mechanisms in general.

Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions of measuring and characterizing possible failures, designing algorithms that degrade gracefully even if they sometimes fail, and so on. But this is particularly important to ARC because our research is looking for worst-case solutions, and even exotic counterexamples are extremely valuable for that search.

1. What might indistinguishable mechanisms look like?

Probabilistic primality tests

The best example I currently have of a “hard case” for distinguishing mechanisms comes from probabilistic primality tests. In this section I’ll explore that example to help build intuition for what it would look like to be unable to recognize sensor tampering.

The Fermat primality test is designed to recognize whether an integer n is prime. It works as follows:

Pick a random integer a < n.
Compute a^n mod n. This can be done in time polylog(n) via iterated squaring.
Output “pass” if a^n = a (mod n). A prime number always passes.

In almost all cases where this test passes, n is prime. And you can eliminate most false positives by just trying a second random value of a. But there are a few cases (“Carmichael numbers”) for which this test passes for most (and in fact all) values of a.

Primes and Carmichael numbers both pass the Fermat test. This turns out to be equivalent to saying that “For all primes p dividing n, (p-1) divides (n-1).” For primes this happens because n is a prime and so there is only one prime divisor p and p-1 = n-1. For Carmichael numbers it instead happens because (p-1) and (n-1) are both highly divisible and a bunch of favorable coincidences occur. We can think of this as building a test that’s supposed to detect factors of n, and then there happens to be a ton of cancellation so that we don’t see anything.

Suppose that we observed the Fermat testing passing in a bunch of random cases and formed the abstraction of “prime” to explain that phenomenon. Then we try to use AI to find a large prime that satisfies some additional property P. If we do this by using the Fermat test as our definition of primality, we may end up getting a Carmichael number, if the property P is rare amongst primes but common amongst composites.

In reality there are probabilistic primality tests that don’t have false positives. But we can imagine a world where gradient descent is able to find the Fermat primality test without us having any algorithm that can learn one of those stronger tests. We’ll revisit the plausibility of this situation in section 2.

The analogy

An analogous situation could make it hard to detect sensor tampering:

We have sensors that are often predicted to show smiling human faces (= the Fermat test often passes for random numbers n)
We introduce the abstraction of “happy human” to explain those correlated observations (= we introduce the abstraction of ”n is prime” to explain the Fermat test passing)
The model also predicts smiling human faces if an action would tamper with sensors (= the Fermat test also passes if n is a Carmichael number)
We use AI to select for tractable actions that lead to predictions of happy human faces and end up with sensor tampering (= we select for Fermat test passing with an additional constraint P and end up with Carmichael numbers)

This is a problem if the best way to act competently is to train a sophisticated predictor and then use it for planning, and if all sophisticated predictors have this unfortunate mechanism-indistinguishability property. The problem can materialize either when we build a model-based algorithm that plans against such a predictor; or when SGD learns a neural network internally which internally selects actions based on a similar predictor.

It’s worth emphasizing that this feels like an extremely weird situation. In some sense the predictor “doesn’t know” whether there are really smiling humans or whether the sensors are tampered with, and couldn’t answer that question honestly even if it wanted to. You could imagine calling this “not an alignment failure” (or at least not an ELK failure) depending on how you define blurry words like “knowledge” and “intent.” But at the end of the day we would still end up with a competent AI which produced creative and competent actions that disempower humanity, and I think we should try to avoid this failure whatever you call it.

If we had a clear definition of “knowledge” then I could imagine drawing a bright line and saying this isn’t an alignment failure, in the same way I would if e.g. my AI started a nuclear war in the same way that a similarly-competent human would. But I think you should be very hesitant to do that while “knowledge” is such a vague concept, because I think it facilitates sweeping all failures into the gaps that we don’t understand. Instead our priority should be to understand and ideally rule out this strange potential situation where the AI creatively and competently kills us all without even knowing that it does so.

2. Are distinct mechanisms efficiently distinguishable?

ARC’s current “plan A” is that any pair of sufficiently distinct mechanisms can be efficiently distinguished. This would be really convenient for a bunch of reasons and it looks plausible, so we’ll keep investigating it until we feel we’ve ruled it out.

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1 for any particular input x, (ii) there is no known efficient discriminator for distinguishing those mechanisms for a given input x. I would really love to have such examples.

That said, there are still two big reasons that I’m skeptical about the conjecture that distinct mechanisms are always distinguishable: (a) it’s a really striking claim for which a failure to find counterexamples isn’t very strong evidence, and (b) in known examples like primality testing it still seems easy to imagine the situation where we can find the mechanism but not the discriminator, i.e. we haven’t yet found an automated way to learn a discriminator.

Overall if I had to guess I’d say maybe a 20% chance that there is a formal version of “all distinct mechanisms are distinguishable” which is true and sufficient to rule out sensor tampering. This is still high enough that it’s a significant priority for me until ruled out.

A. This is a striking claim and judging counterexamples is hard

Any universally-quantified statement about circuits is pretty striking — it would have implications for number theory, dynamical systems, neural nets, etc. It’s also pretty different from anything I’ve seen before. So the odds are against it.

One piece of evidence in favor is that it’s at least plausible: it’s kind of weird for a circuit to have a hidden latent structure that can have an effect on its behavior without being detectable.

Unfortunately there are plenty of examples of interesting mathematical circuits (e.g. primality tests) that reveal the presence of some latent structure (e.g. a factorization) without making it explicit. Another example I find interesting is a determinant calculation revealing the presence of a matching without making that matching explicit. These examples undermine the intuition that latent structure can’t have an effect on model behavior while remaining fully implicit.

That said, I don’t know of examples where the latent structure isn’t distinguishable. Probabilistic primality testing comes closest, but there are in fact good primality tests. So this gives us a second piece of evidence for the conjecture.

Unfortunately, the strength of this evidence is limited not only by the general difficulty of finding counterexamples but also by the difficulty of saying what we mean by “distinct mechanisms.” If we could really precisely state a theorem then I think we’d have a better chance of finding an example if one exists, but as it stands it’s hard for anyone to engage with this question without spending a lot of time thinking about a bunch of vague philosophy (and even then we are at risk of gerrymandering categories to avoid engaging with an example).

B. Automatically finding a good probabilistic primality test seems hard

The Fermat test can pass either from primes or Carmichael numbers. It turns out there are other tests that can distinguish those cases, but it’s easy to imagine learning the Fermat test without being able to find any of those other superior tests.

To illustrate, let’s consider two examples of better tests:

Rabin-Miller: If a^(n-1) = 1 (mod n), we can also check a^(n-1)/2. This must be a square root of 1, and if n is prime it will be either +1 or -1. If we get +1, then we can keep dividing by 2, considering a^(n-1)/4 and so on. If n is composite then 1 has a lot of square roots other than +1 and -1, and it’s easy to prove that with reasonably high probability one of them will appear in this process.
Randomized AKS: If n is prime and X is an indeterminate, then (a+X)^n = (a^n + X^n) = (a + X^n) mod n. This condition is hard to evaluate, but if we arbitrarily define X^r = 1 for some small number r then we can compute (a + X)^n mod n by iterated squaring in time O(r log n). If n is composite, it turns out there is a high probability that (a+X)^n != (a + X^n) mod n.

While these tests are structurally similar to the Fermat test, there’s no obvious way to automatically transform the Fermat test into either of these stronger formats. And so if we had learned the Fermat test, it’s not clear we’d have any way to find one of the stronger tests without learning them. Moreover, while these tests are somewhat simple, they are more complex than the Fermat test, and so this learning process might be much harder than the problem of learning the Fermat test itself.

To avoid this problem we’d like to specify a way to represent these stronger tests that uses fewer (additional) parameters than the Fermat test itself. As a stylized example, you could imagine specifying a stronger test by pointing to particular parts of the Fermat test and saying “randomize these parts.” As long as the number of parameters describing “which parts to randomize” was smaller than the number of parameters in the Fermat test itself, we’d be in business.

While there’s no obvious representation like that, it seems plausible to me that we could find, particularly for the AKS test. This would be very suggestive that we could find a stronger test as quickly as we can find the Fermat test, but wouldn’t be totally decisive since e.g. gradient descent could have an easier time with one learning problem than the other (and that will depend on more details).

Of course it’s particularly plausible that you could have the Fermat test but no robust test because that was humanity’s situation for many years! An automated strategy for extending the Fermat test to distinguish primes from Carmichael numbers would appear to teach us something fundamental about primality testing, that wasn’t known to humanity until the 70s, without making reference to any facts about numbers. So that seems like a tall order.

I’m focused on the primality testing example in large part because it’s the case where the situation seems most hopeless. If we could find one in this case I would become meaningfully more optimistic about the overall conjecture.

3. Being unable to distinguish mechanisms is bad news

If it’s impossible to distinguish distinct mechanisms, then that rules out approaches to ELK based on mechanistic anomaly detection alone. But the situation seems much worse than that, and in fact it seems like any approach to alignment is going to need to exploit some additional property of sensor tampering to rule it out.

In particular:

The kinds of approaches discussed in Eliciting latent knowledge are complete non-starters. All those approaches try to define a loss function so that the strategy “answer questions honestly” gets a low loss. But if you can’t learn to recognize sensor tampering then it doesn’t matter how low a loss you’d get by answering questions honestly, gradient descent simply can’t learn how to do it. Analogously, if there’s no simple and efficient primality test, then it doesn’t matter whether you have a loss function which would incentivize primality testing, you’re not going to be able to do it.
Avoiding sensor tampering by interpretability runs into exactly the same problem. If there’s no efficient algorithm for recognizing sensor tampering, then no matter in what sense you “understand” what the model is doing, you still can’t tell whether it thinks an action is good because of sensor tampering or because of doing the task as intended. Interpretability helps us deal with the case where we have no loss function to incentivize the discriminator we need, but it won’t help if there simply doesn’t exist any efficient discriminator.
Approaches like debate or amplification are unable to help if the best way to plan is to use a learned model for which discrimination is impossible. When we try to apply these methods we will use something like imitative generalization, performing a search over augmented-human-legible hypotheses. But in fact the best hypotheses look like opaque predictors, and so we are back to trying and failing to solve ELK for those hypotheses. This is discussed a bit more here in the ELK doc and was the motivation for focusing on ELK. On the bright side, when these approaches fail it may simply lead to uncompetitive systems rather than leading to sensor tampering, even in the worst case. But on the downside, sufficiently large competitiveness problems may be catastrophic without coordination.
Recursive reward modeling can solve this problem (where debate can’t) only if a human and AI team can harden sensors against manipulation, including in the case where the “attacker” can use sophisticated predictors for which discrimination is hard but “defenders” can’t. I think this is very plausible, but probably does require substantive additional empirical assumptions; we’ll discuss the approach of “just make sensor tampering impossible” in section 4. Of course this is particularly plausible if we can apply AI systems in ways that don’t give them rich opportunities for sensor tampering, and in general the efficacy of approaches like RRM or debate depends a lot on the political and social environment in which AI systems are deployed.
Some work on AI alignment focuses on understanding the inductive biases of SGD, how training shapes values, on the role of “habits” and non-maximizing behavior, on analogies with humans, and so on. But if planning against a sophisticated predictive model is an effective strategy for decision-making then ultimately every one of these approaches must confront the same issue: either by building systems that explicitly use planning against a learned model, or by training systems that need to use this strategy internally. Empirical work mostly seems helpful for grappling with indistinguishable mechanisms insofar as empirical work can help us notice this problem in advance if it’s a real problem.

Overall my sense is that if distinct mechanisms can be indistinguishable, then any indefinitely scalable solution to alignment will need to use additional properties of sensor tampering or assumptions about the structure of optimal decision-making. That said, even in relatively pessimistic worlds there may be a long period of time (potentially extending to fairly superhuman systems) for which these issues are surmountable.

4. Approaches to sensor tampering assuming indistinguishable mechanisms

In this section we’ll discuss three possible approaches to sensor tampering:

Harden sensors such that tampering with them is harder than achieving the intended task.
Harden sensors enough that you can’t tamper with them without “trying,” and then argue that if you are “trying” to tamper with sensors then you can discriminate.
Try to characterize some other sense in which models “don’t know what’s going on” when mechanisms are indistinguishable. Design algorithms that work as long as the AI “knows what’s going on” and then validate that assumption.

Our main hope is to find a way to efficiently distinguish distinct mechanisms. But it still seems worth fleshing out these alternatives.

A. Harden sensors so that tampering is harder than the intended task

We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.

Some dynamics make this seem more plausible, at least as a way to avoid the particular obstruction in this post (and therefore open up a much broader universe of “outs”):

We can use AI systems to help us build sensors and understand their potential failures. There are potential difficulties in training such systems, but there are a lot of promising options that haven’t been very thoroughly explored and ultimately the obstruction in this post doesn’t speak directly to the difficulty of that problem. What matters becomes not the gap between humans and AI systems, but between “AI systems using opaque predictors for which discrimination is impossible” and “all the other AI.”
To address the difficulty in this post we don’t need to build a sensor that reliably tells us if a robot rebellion is occurring, we just need to be able to arbitrate a debate about whether a particular set of recorded actions and observations is potentially the result of sensor tampering, and we can use a very conservative standard. That means we get to comb over observations for evidence of anything avoidably funny.
Many plans for tampering with sensors are only scary because they involve creating external agents and delegating the sensor-tampering task to them. But we can closely supervise actions and look for evidence of this kind of delegation. We can also ask friendly systems to perform similar delegation to defend sensors (again modulo complicated questions about how to train such systems).

On the other hand, there are two big reasons we’re scared of this kind of approach:

It seems highly sensitive to how your predictive models generalize. If a “sensor reading” is simply whatever ultimately ends up in a dataset in a datacenter somewhere, then any successful robot rebellion would constitute impossible-to-detect sensor tampering since it could ultimately erase any evidence recorded by sensors. In practice we expect even deeper forms of underdetermination than “what is the definition of a sensor?” and so we would strongly prefer to have approaches (like mechanistic anomaly detection) that don’t require making strong structural assumptions of this type.
It is really scary to imagine a world where in a strong sense all the most capable AI systems “want” to tamper with sensors and are prevented only because the balance of physical power favors the defender. Such a world feels highly unstable and vulnerable to mistaken assumptions or external shocks. This vague unease could be defused by thinking very carefully through the actual risks, but it’s not a direction that is intuitively appealing to us.

B. Detect sensor tampering that requires “trying”

So far we’ve talked about training a sophisticated predictive model and then imagined an exhaustive or random planner who searches against it.

While it seems hard to make sensors so robust that a powerful AI cannot tamper with them, it seems much easier to make sensors so robust that you can’t tamper with them by randomly generating actions.

In that case, even if we have a sophisticated predictor which doesn’t distinguish sensor tampering from the intended behavior, we don’t have trouble unless there is also an algorithm which proposes actions that have an elevated risk of leading to sensor tampering.

Of course any policy which proposes coherent actions would tend to elevate the probability of successful sensor tampering alongside the probability of successfully doing the task. But we may hope to harden the sensor far enough that this kind of generic prior over competent actions still won’t lead to sensor tampering unless it is specifically driving at that goal.

So this suggests a two step plan:

Harden sensors enough that in some sense you can’t “accidentally” tamper with them.
Argue that the proposer can distinguish actions that are “deliberately” tampering from actions that are trying to do the task as intended, even if the predictor cannot.

Both of these steps seem challenging. Conceptually the second step seems like the main blocker — in cases where a predictor can predict success without distinguishing mechanisms, it also seems possible for a proposer to propose successful plans without distinguishing mechanisms. But it’s not at all a trivial link, and e.g. the primality testing example can’t be easily adapted into this format suggesting that it might be a fundamentally different step.

One reason you might think that this approach is doomed is that there is always a simple action that tampers with the sensors, namely: build an external agent trying to tamper with the sensors. This gives an absolute lower bound on the probability of accidentally tampering with the sensors no matter how much they are hardened.

But if the external agent itself must do something to deliberately tamper, then that still means that there exists an efficient procedure for detecting the sensor tampering and leaves us with a possible out — it seems plausible that this approach can be extended to work for external agents if and only if it works in cases with no external agents. It generally seems hard to write off possible approaches based on counterexamples that involve creating external agents, since those counterexamples can often be handled by separate tactics (we do ultimately need to confront such counterexamples after handling the “base case,” but it seems like a mistake to write off approaches too early based on this difficulty).

C. Assume that your AI “knows what’s going on”

It would be very strange if (human faces) → (sensors) was analogous to (primality) → (Fermat test), because it suggests that our AI’s grip on the concept of “human faces” is extremely tenuous. Such a model could be reasonably said to “not know” whether there are really humans in the room or if sensors have been tampered with.

In section B we argued that maybe this meant the AI didn’t pose a threat, because tampering with sensors is only possible if you actually understand what you are doing.

But even if that hope doesn’t work out, we could still try to:

Formalize “knowledge” such that the AI systems we are worried about “know” that they are disempowering humanity.
Design algorithms that detect sensor tampering that our AI “knows” about.
Verify that our AI understands the situation well enough to “know” if it is tampering with sensors, either empirically or by making some additional argument about sensor tampering.

As a vague illustration, we could try to argue that the AI systems we are worried about have a certain kind of mechanical understanding of the systems they are predicting. We could show that this kind of mechanical understanding is inconsistent with primality-test-like cases (which are possibly only because “is prime” lacks interesting internal structure and is more like a raw fact about the world), and then we could empirically monitor whether our AI systems actually have this kind of rich understanding.

As another illustration, we could try to relate “knowledge” directly to the ability to answer certain kinds of questions, including honestly answering questions about what’s happening in the world, and then carefully monitor how hard it is for our AI to answer those questions so that we’d notice if it became hard.

This is more of a vague catch-all than either approach A or B, but I think it deserves attention because the situation of an AI tampering without knowing it feels extremely weird. So there may be room to make sense of that situation and show how it can’t happen.

Conclusion

It may be possible for a behavior to arise via two very different mechanisms without there being any efficient way to distinguish those mechanisms. That would be a fatal problem for mechanistic anomaly detection, but it would also probably mean that any approach to AI alignment will need to exploit some additional structure about sensor tampering in order to avoid it. This may take the form of clarifying the intuition that problematic AI systems surely “know” they are tampering with sensors; it may also be a serious obstacle to solving alignment under a clearly-articulated set of assumptions rather than relying on messy empirical contingencies.

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

Yeah this was super unclear to me; I think it's worth updating the OP.

It means f(x) = 1 is true for some particular x's, e.g., f(x_1) = 1 and f(x_2) = 1, there are distinct mechanisms for why f(x_1) = 1 compared to why f(x_2) = 1, and there's no efficient discriminator that can take two instances f(x_1) = 1 and f(x_2) = 1 and tell you whether they are due to the same mechanism or not.

Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.

If the AI can't practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn't asking for a coherent thing, we don't have to smash our heads against that brick wall, we can implement "what to do when the human asks for an incoherent thing" contingency plans.
- (e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
- Our big advantage over philosophy is getting to give up when we've asked for something impossible, and ask for something possible. That said, it's premature to declare any of the problems mentioned here impossible - but it means that case analysis type reasoning where we go "but what if it's impossible?" seems like it should be followed with "here's what we might try to do instead of the impossible thing."
It seems likely that there are concepts that aren't guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
- So checking whether an arbitrary AI "knows a thing" is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
- Is getting to build the AI really more powerful? It doesn't seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.
  - The guarantees I'm used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.
The "assume the AI knows what's going on" test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn some concept based on the training data that helps it get better test scores.

The first intuition pump that comes to mind for distinguishing mechanisms is examining how my brain generates and assigns credence to the hypothesis that something going wrong with my car is a sensor malfunction vs telling me about a problem in the world that the sensor exists to alert me to.

One thing that happens is that the broken sensor implies a much larger space of worlds because it can vary arbitrarily instead of only in tight informational coupling with the underlying physical system. So fluctuations outside the historical behavior of the sensor either implies I'm in some sort of weird environment or that the sensor is varying with something besides what it is supposed to measure, a hidden variable if coherent or noisy if random. So the detection is tied to why it is desirable to goodhart the sensor in the first place, more option value by allowing consistency with a broader range of worlds. By the same token, the hypothesis "the sensor is broken" should be harder to falsify since the hypothesis is consistent with lots of data? The first thing it occurs to me to do is supply a controlled input to see if I get a controlled output (see: calibrating a scale by using a known weight). This suggests that complex sensors that couple with the environment along more dimensions are harder to fool, though any data bottlenecks that are passed through reduce this i.e. the human reviewing things is themselves using a learnable simple routine that exhibits low coupling.

The next intuition pump, imagine there are two mechanics. One makes a lot of money from replacing sensors, they're fast at it and get the sensors for a discount by buying in bulk. The second mechanic makes a lot of money by doing a lot of really complicated testing and work. They work on fewer cars but the revenue per car is high. Each is unscrupulous and will lie that your problem is the one they are good at fixing. I try to imagine the sorts of things they would tell me to convince me the problem is really the sensor vs the problem is really out in the world. This even suggests a three player game that might generate additional ideas.

Curated! This is a good description of a self-contained problem for a general class of algorithms that aim to train aligned and useful ML systems, and you've put a bunch of work put into explaining reasons why it may be hard, with a clear and well-defined example for conveying the problem (i.e. that Carmichael numbers fool Fermi's Primality Test).

The fun bit for me is talking about how if this problem goes one way (where we cannot efficiently distinguish different mechanisms) this invalidates many prior ideas, and if it doesn't then we can be more optimistic that we're close to a good alignment algorithm, but you're honestly not sure! (You give it a 20% chance of success.) And you also go through a list of next-steps if it doesn't work out. Great contribution.

I am tempted to say something about how the writing seems to me much clearer than previous years of your writing, but I think this is also in part due to me (a) understanding what you are trying to do better and (b) having stronger basic intuitions for thinking about machine learning models. Still, I think the writing is notably clearer, which is another reason to curate.

It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:

There exists an efficient algorithm which when run on a model input will output which mechanism model $M$ uses when run on $x$ .
There exists an efficient algorithm which when run on a model $M$ will output another programme $P$ such that when we run $P$ on a model input $x$ it will output (in reasonable time) which mechanism $M$ uses when run on $x$ .

In general, being able to do (2) implies that we are able to do (1). It seems that in practice we’d like to be able to do (2), since then we can apply this to our predictive model and get an algorithm for anomaly detection in any particular case. (In contrast, the first statement gives us no guide on how to construct the relevant distinguishing algorithm.)

In your “prime detection” example, we can do (1) - using standard primality tests. However, we don’t know of a method for (2) that could be used to generate this particular (or any) solution to (1).

It’s not clear to me which notion you want to use at various points in your argument. In several places you talk about there not existing an efficient discriminator (i.e. (1)) - for example, as a requirement for interpretability - but I think in this case we’d really need (2) in order for these methods to be useful in general.

Thinking about what we expect to be true in the real world, I share your intuition that (1) is probably false in the fully general setting (but could possibly be true). That means we probably shouldn’t hope for a general solution to (2).

But, I also think that for us to have any chance of aligning a possibly-sensor-tampering AGI, we require (1) to be true in the sensor-tampering case. This is because if it were false, that would mean there’s no algorithm at all that can distinguish between actually-good and sensor-tampering outcomes, which would suggest that whether an AGI is aligned is undecidable in some sense. (This is similar to the first point Charlie makes.)

Since my intuition for why (2) is false in general mostly runs through my intuition that (1) is false in general, but I think (1) is true in the sensor-tampering case (or at least am inclined to focus on such worlds), I’m optimistic that there might be key differences between the sensor-tampering case and the general setting which can be exploited to provide a solution to (2) in the cases we care about. I’m less sure about what those differences should be.

Your overall picture sounds pretty similar to mine. A few differences.

I don't think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
The reason that's OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I'm really interested in some modified version of (2), call it (2'). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn't exactly right).
I mostly care about (1) because the difficulty of (1) seems like the main reason to think that (2') is impossible. Conversely, if we understood why (1) was always possible it would likely give some hints for how to do (2). And generally working on an easier warm-up is often good.
I think that if (1) is false in general, we should be able to find some example of it. So that's a fairly high priority for me, given that this is a crucial question for the feasibility or character of the worst-case approach.
That said, I'm also still worried about the leap from (1) to (2'), and as mentioned in my other comment I'm very interested in finding a way to solve the harder problem in the case of primality testing.

Right now I'm trying to either:

Find another good example of a model behavior with two distinct but indistinguishable mechanisms.
Find an automatic way to extend the Fermat test into a correct primality test. Slightly more formally, I'd like to have a program which turns (model, explanation of mechanism, advice) --> (mechanism distinguisher), where the advice is shorter than the model+explanation, and where running it on the Fermat test with appropriate advice gives you a proper primality test.
Identify a crisp sense in which primes vs Carmichael numbers aren't really distinct mechanisms (but sensor tampering would be). If this happens then maybe the only reason the Fermat test feels like a good example is that I'm mischaracterizing it.

An interesting paper on successfully distinguishing different mechanisms inside image classification models: https://arxiv.org/pdf/2211.08422.pdf — for this small model they correspond to different, disconnected local minimal of the loss function (I assume basically because it only has enough capacity to implement one strategy really well, so it has to pick one). They even outline approaches to move models from one mechanism that doesn't generalize well to another that does.

I don't immediately see how to extend this to the sort of different mechanisms that Paul was discussing, but it feels like it might be relevant; albeit the mechanisms might be a lot less clearly separable on something as complex and multi-task-capable as an AGI, which might well need to learn multiple capabilities (possibly including deceit) and then have a way of deciding which one to apply in a particular case.

One thing that is pretty clear is that an honest mechanism and a deceitful mechanism are going to have very different latent knowledge inside them: "how to I keep the diamond safe?" and "how do I tamper with the sensors so the diamond looks safe?" are very different problems. They're also potentially of different difficulty levels, which might have a big effect on which one gradient descent, or indeed smart AGI optimization, is going to find a solution to first. If our sensors were hardened enough to make fooling them really difficult, that might make finding a passable (and improvable) approach to vault safety much easier than to fooling the humans, at least for gradient descent. Of course, while gradient descent generally stays in whatever local minimum it found first, and AGI doing optimization probably doesn't have that limitation, and could decide to switch strategies. On the other hand, the strategy "don't do any work other than fooling the humans" generalizes really well to many different problems.

However, I still feel that this approach to AGI safety is like trying to build barriers between yourself and something malicious and very smart, and you're a lot better off if the system doesn't have anything malicious in it to start off with. So, I'm a lot more optimistic about an AGI that's a value learner, can figure out that we don't want to be shown deceitful images (not a very hard problem in human values), and then not do that because it knows that's not what we want.

A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.

Suppose the diamond in the room we're monitoring contained a security tag, powered by incoming light, with a cryptographic secret in it, and an interface that allows us to continuously do a zero-knowledge identity proof on it via a laser optical interface, using a laser pointed into the room as input and the (high-speed) camera as output. So to successfully tamper with the camera, the AI would need access to the secret inside the security tag inside the diamond, or at least to reroute the channel to-and-from it. So it would need to be able to steal the diamond, extract the security tag, and reverse engineer its cryptographic secret out of it in a time shorter then the period between camera frames, or similarly replace the diamond with an optically-similar replacement now containing the security tag in the room, or integrate the security tag and its input-output channel into its camera-spoofing system, in each case in the time between frames. None of these are actually physically impossible, but they could be made extremely hard to do in a small fraction of a second (say we're using a really-high speed camera with basically no temporal gaps between frames).

The basic idea here is that cryptographic techniques may be useful when hardening things. Whether this approach could usefully be extended to anything significantly more AGI-alignment-like than a diamond-in-a-vault scenario is unclear.

I think there's a sense in which the Fermat test is a capability problem, not an interpretability/alignment problem.

It's basically isomorphic to a situation in which sensor tampering is done via a method that never shows up in the AI's training data. E. g., suppose it's done via "etheric interference", which we don't know about, and which never fails and therefore never leads to any discrepancies in the data so the AI can't learn it via SSL either, etc. Then the AI just... can't learn about it, period. It's not that it can, in theory, pick up on it, but instead throws out that data and rolls both "no tampering" and "etheric tampering" into the same mechanistic explanation. It's just given no data on it to begin with. If etheric tampering ever fails, in a way that impacts the visible data, only then the AI can notice it and internally represent it.

Same for the Fermat primality test. If, during the training, we feed the AI a dataset of numbers that happens not to include any Carmichael numbers, it's basically equivalent to teaching the AI on a dataset that includes no tampering of some specific type. If then, at runtime, a Carmichael number shows up (someone tampers with the sensors in that OOD way), the AI just fails, because it hasn't been given the necessary training data not to fail.

So, yeah, if our sensors are going to be interfered-with by magic/aliens/supercriminals, in novel ways we don't know about and which don't show up on our training data, our AI won't be able to foresee that either. But that's a pure capability problem, solved via capability methods (catch them in the act and train the AI on the new data).

Edit: And if our threat model is the AI itself doing the tampering — well, it can hardly do it via a method it doesn't know about, can it? And if it's generating ~random actions and accidentally tampers with the sensors in a way it didn't intend, that also seems like a capability problem that'll be solved by routine capability work.

Also, in this case the AI's generate-random-actions habits probably result in many more unintended side-effects than just sensor tampering, so its error should be easily noticed and corrected. The corner case is where there's some domain of actions we don't know about and the AI doesn't know about, whose only side-effects results in sensor tampering. But that seems very unlikely and— again, that problem has nothing to do with the AI's decision-making process, alignment, or lack thereof, and everything to do with no-one in the situation (AI or human) knowing how the world works.

The thing I'm concerned about is: the AI can predict that Carmichael numbers look prime (indeed it simply runs the Fermat test on each number). So it can generate lots of random candidate actions (or search through actions) until it finds one that looks prime.

Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.

I'm not worried about cases like the etheric interference, because the AI won't select actions that exploit etheric interference (since it can't predict that a given action will lead to sensor tampering via etheric interference). I'm only worried about cases where the prediction of successful sensor tampering comes from the same laws / reasoning strategies that the AI learned to make predictions on the training distribution (either it's seen etheric interference, or it e.g. learned a model of physics that correctly predicts the possibility of etheric interference).

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Yeah, in that case, the only viable way to handle this is to get something into the system that can distinguish between no tampering and etheric interference. Just like the only way to train an AI to distinguish primes from Carmichael numbers is to find a way to... distinguish them.

Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then. I guess I can link the Gooder Regulator Theorem, which seems to formalize the "to get a model that learns to distinguish between two underlying system-states, we need a test that can distinguish between two underlying system-states".

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Mostly--the opaque test is something like an obfuscated physics simulation, and so it tells you if things look good. So you try a bunch of random actions until you get one where things look good. But if you can't understand the simulation, or the mechanics of the sensor tampering, then there's not much to do after that so it seems like we're in trouble.

Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then.

It seems like there are plenty of hopes:

I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it's OK.
In fact it is possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it's not clear how to find it automatically. We could work on that.
We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can't happen by chance and then declare victory.

More generally, I'm not happy to give up because "in this situation there's nothing we can do," I want to understand whether the bad situation is plausible, and if it is plausible then how you can measure to fee it is' happening, and how to formalize the kind of assumptions that we'd need to make the problem soluble.

the opaque test is something like an obfuscated physics simulation

I think it'd need to be something weirder than just a physics simulation, to reach the necessary level of obfuscation. Like an interwoven array of highly-specialized heuristics and physical models which blend together in a truly incomprehensible way, and which itself can't tell whether there's etheric interference involved or not. The way Fermat's test can't tell a Carmichael number from a prime — it just doesn't interact with the input number in a way that'd reveal the difference between their internal structures.

By analogy, we'd need some "simulation" which doesn't interact with the sensory input in a way that can reveal a structural difference between the presence of a specific type of tampering and the absence of any tampering at all (while still detecting many other types of tampering). Otherwise, we'd have to be able to detect undesirable behavior, with sufficiently advanced interpretability tools. Inasmuch as physical simulations spin out causal models of events, they wouldn't fit the bill.

It's a really weird image, and it seems like it ought to be impossible for any complex real-life scenarios. Maybe it's provably impossible, i. e. we can mathematically prove that any model of the world with the necessary capabilities would have distinguishable states for "no interference" and "yes interference".

Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually.

It seems like there are plenty of hopes

Oh, I agree. I'm just saying that there doesn't seem to be any other approaches aside from "figure out whether this sort of worst case is even possible, and under what circumstances" and "figure out how to distinguish bad states from good states at the object-level, for whatever concrete task you're training the AI".

I definitely agree that this sounds like a really bizarre sort of model and it seems like we should be able to rule it out one way or another. If we can't then it suggests a different source of misalignment from the kind of thing I normally worry about.

Can you clarify what you mean by this, especially (i)?

In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.

In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?

Yeah this was super unclear to me; I think it's worth updating the OP.

Family's coming over, so I'm going to leave off writing this comment even though there are some obvious hooks in it that I'd love to come back to later.

If the AI can't practically distinguish mechanisms for good vs. bad behavior even in principle, why can the human distinguish them? If the human cant distinguish them, why do we think the human is asking for a coherent thing? If the human isn't asking for a coherent thing, we don't have to smash our heads against that brick wall, we can implement "what to do when the human asks for an incoherent thing" contingency plans.
- (e.g. have multiple ways of connecting abstract models that have this incoherent thing as a basic labeled cog to other more grounded ways of modeling the world, and do some conservative averaging over those models to get a concept that tends to locally behave like humans expect the incoherent thing to behave on everyday cases.)
- Our big advantage over philosophy is getting to give up when we've asked for something impossible, and ask for something possible. That said, it's premature to declare any of the problems mentioned here impossible - but it means that case analysis type reasoning where we go "but what if it's impossible?" seems like it should be followed with "here's what we might try to do instead of the impossible thing."
It seems likely that there are concepts that aren't guaranteed to be learned by arbitrary AIs, even arbitrary AIs from the distribution of AI designs humans would consider absent alignment concerns, but still can be taught to an AI that you get to design and build from the ground up.
- So checking whether an arbitrary AI "knows a thing" is impressive, but to me seems impressive in a kind of tying-one-hand-behind-your-back way.
- Is getting to build the AI really more powerful? It doesn't seem to literally work with worst-case reasoning. Maybe I should try to find a formalization of this in terms of non-worst case reasoning.
  - The guarantees I'm used to rely on there being some edge that the thing we want to teach has. But as in this post, maybe the edge is inefficient to find. Also, maybe there is no edge on a predictive objective, and we need to lay out some kind of feedback scheme from humans, which has problems that are more like philosophy than learning theory.
The "assume the AI knows what's going on" test seems shaky in the real world. If we ask an AI to learn an incoherent concept, it will likely still learn some concept based on the training data that helps it get better test scores.

It seems to me like there are a couple of different notions of “being able to distinguish between mechanisms” we might want to use:

There exists an efficient algorithm which when run on a model input will output which mechanism model $M$ uses when run on $x$ .
There exists an efficient algorithm which when run on a model $M$ will output another programme $P$ such that when we run $P$ on a model input $x$ it will output (in reasonable time) which mechanism $M$ uses when run on $x$ .

Your overall picture sounds pretty similar to mine. A few differences.

I don't think the literal version of (2) is plausible. For example, consider an obfuscated circuit.
The reason that's OK is that finding the de-obfuscation is just as easy as finding the obfuscated circuit, so if gradient descent can do one it can do the other. So I'm really interested in some modified version of (2), call it (2'). This is roughly like adding an advice string as input to P, with the requirement that the advice string is no harder to learn than M itself (though this isn't exactly right).
I mostly care about (1) because the difficulty of (1) seems like the main reason to think that (2') is impossible. Conversely, if we understood why (1) was always possible it would likely give some hints for how to do (2). And generally working on an easier warm-up is often good.
I think that if (1) is false in general, we should be able to find some example of it. So that's a fairly high priority for me, given that this is a crucial question for the feasibility or character of the worst-case approach.
That said, I'm also still worried about the leap from (1) to (2'), and as mentioned in my other comment I'm very interested in finding a way to solve the harder problem in the case of primality testing.

Right now I'm trying to either:

Find another good example of a model behavior with two distinct but indistinguishable mechanisms.
Find an automatic way to extend the Fermat test into a correct primality test. Slightly more formally, I'd like to have a program which turns (model, explanation of mechanism, advice) --> (mechanism distinguisher), where the advice is shorter than the model+explanation, and where running it on the Fermat test with appropriate advice gives you a proper primality test.
Identify a crisp sense in which primes vs Carmichael numbers aren't really distinct mechanisms (but sensor tampering would be). If this happens then maybe the only reason the Fermat test feels like a good example is that I'm mischaracterizing it.

A. Harden sensors so that tampering is harder than the intended task
We could design and deploy a lot of redundant sensors, and do science and red-teaming to understand the possible principles by which those sensors can be compromised. In the modern world it’s kind of insane to imagine a human rebellion that not only succeeded but left no trace of itself.

I think there's a sense in which the Fermat test is a capability problem, not an interpretability/alignment problem.

Similarly, your AI can consider lots of actions until it finds one that it predicts will look great, then execute that one. So you get sensor tampering.

So the concern is that "the AI generates a random number, sees that it passes the Fermat test, and outputs it" is the same as "the AI generates a random action, sees that it passes [some completely opaque test that approves any action that either includes no tampering OR includes etheric interference], and outputs it", right?

Okay, that's literally tautological. I'm not sure this problem has any internal structure that makes it possible to engage with further, then.

It seems like there are plenty of hopes:

I have a single lame example of the phenomenon, which feels extremely unlike sensor tampering. We could get more examples of hard-to-distinguish mechanisms, and understand whether sensor tampering could work this way or if there is some property that means it's OK.
In fact it is possible to distinguish Carmichael numbers and primes, and moreover we have a clean dataset that we can use to specify which one we want. So in this case there is enough internal structure to do the distinguishing, and the problem is just that the distinguisher is too complex and it's not clear how to find it automatically. We could work on that.
We could try to argue that without a model of sensor tampering an AI has limited ability to make it happen, so we can just harden sensor enough that it can't happen by chance and then declare victory.

the opaque test is something like an obfuscated physics simulation

Models of world-models is a research direction I'm currently very interested in, so hopefully we can just rule that scenario out, eventually.

It seems like there are plenty of hopes

45

Can we efficiently distinguish different mechanisms?

45

Background

Are distinct mechanisms enough?

Roadmap

1. What might indistinguishable mechanisms look like?

Probabilistic primality tests

The analogy

2. Are distinct mechanisms efficiently distinguishable?

A. This is a striking claim and judging counterexamples is hard

B. Automatically finding a good probabilistic primality test seems hard

3. Being unable to distinguish mechanisms is bad news

4. Approaches to sensor tampering assuming indistinguishable mechanisms

A. Harden sensors so that tampering is harder than the intended task

B. Detect sensor tampering that requires “trying”

C. Assume that your AI “knows what’s going on”

Conclusion