Some Hacky ELK Ideas

johnswentworth

Credit to Adam Shimi, Alex Flint, and Rob Miles for discussions, counterexamples, and general input to the ideas here.

Quick recap for anyone who didn’t read the hundred-page Eliciting Latent Knowledge document:

We have a diamond in a vault, with a bunch of automated theft-defenses.
We train a predictor to take in the vault’s video-stream and a plan for the vault’s actuators, and predict future video frames.
We train a planner to find plans which the predictor predicts will end with the video feed still showing a diamond in the vault.
We want some way for a human to probe the latent knowledge of the predictor, e.g. to check if the predictor expects a screen showing a diamond will be placed in front of the camera.

The ELK document largely discussed the human running natural-language queries on the predictor’s latent knowledge, but we’re going to mostly ignore that part, because it’s harder than everything else about the problem combined and I don’t think it’s actually that central.

This post will focus mainly on the specific question of whether there’s a screen in front of the camera in the diamond problem. I expect that the approaches and counterexamples here extend to more general problems of detecting when a plan is “messing with the sensors”, i.e. doing things which decouple observables from the parts of the world we’re actually interested in.

Revisiting Holdout Data

The first approach we’ll discuss is basically “just use holdout data”, but with two major changes compared to previous suggestions.

First, this proposal does not involve training a natural-language “reporter” at all. (IMO the whole idea of training an honest natural language reporter is just as hard as training an honest natural language question-answerer in general, for basically the same reasons; having a predictor’s internal computations on hand does not make it easier in any important way.) Using holdout data to train an honest reporter would be pretty hopeless, but that’s not what we’re trying to do here. We’re just trying to tell whether the predictor expects sensor-tampering from the planner.

For that purpose, there’s an obvious way to detect sensor tampering: cross-check the (predicted) holdout data with the (predicted) non-holdout sensor data. If they don’t match, that’s evidence of (predicted) sensor tampering. The danger, of course, is that the planner will either find and tamper with all of our sensors, or fake sensor readings.

That brings us to the second major change: while the planner will inevitably be incentivized to tamper with the non-holdout sensors, the obvious setup for the planner yields no incentive to trick a human; there’s no incentive to be particularly subtle about the sensor-tampering, and in particular no incentive to fool the holdout sensors.

Formal Setup

Some variable names:

are non-holdout observations from the sensors before the plan begins
$A^{*}$ are the actions constituting the plan
$O b s_{> n} = (O b s_{n + 1}, \dots)$ are non-holdout observations from the sensors after the plan
$H o l d = (H o l d_{1}, \dots)$ are holdout observations (held out from the planner during planning, not from the predictor during training)

We’ll assume that the predictor is a generative model for all observations including holdout, so it effectively encodes a distribution $P [O b s, H o l d | d o (A^{*})]$ . We’ll also assume that the actions have no causal influence on observations before the plan begins, and that the predictor “knows this” (i.e. $P [O b s \leq n | d o (A^{*})] = P [O b s \leq n]$ ).

We’ll assume that the planner maximizes some expected utility function of the later non-holdout observations given the earlier non-holdout observations:

A^* = argmax_A E[u(Obs_{>n})| do(A^* = A), Obs_{\leq n}]

… where the distribution used is the predictor’s model.

Key thing to notice: the holdout observations are not explicitly involved at all in the planner’s evaluation. If we have one plan $A_{1}$ which tampers with only the non-holdout sensors to fake the appearance of the diamond, and another plan $A_{2}$ which tampers with all the sensors to fake the appearance of the diamond, the planner is indifferent between $A_{1}$ and $A_{2}$ . So long as the non-holdout sensor readings are the same, the planner has no incentive to also fake the holdout sensor readings.

Furthermore, if there is ever any tradeoff whatsoever between faking the holdout sensor readings and anything the planner cares about, then it will not fake the holdout sensor readings. In particular, if faking the holdout sensor readings requires resources which could instead be spent better/more reliably faking the non-holdout readings, then the AI will not fake the holdout readings.

But What If There’s An External Adversary?

The holdout sensor approach is probably sufficient to detect sensor-tampering by the planner itself. But an external adversary (like a thief, for instance) might want to fool the human overseer, so they might actually want to screen all the sensors.

What can we do against a powerful thief? And, while we’re at it, what can’t we do?

Thief With Limited Knowledge of Diamond

If the thief has limited information about the diamond at the time that the screen goes up, then we can use a variant of the holdout approach. Imagine, for instance, that the diamond has a tiny imperfection on it which the thief does not know about. Then, when the thief puts up the screen, the screen will presumably show a diamond without that imperfection. By comparing it to sensor readings from before the screen went up, we should be able to detect the change. More generally, we can see how much details of the earlier-diamond correlate with details of the later-diamond (under the predictor's distribution), to detect whether the predictor expects a screen to go up.

(I call this a variant of the holdout approach because we’re effectively using earlier sensor readings as the holdout data. If we’re also worried about the planner messing with the sensors, we can combine this directly with our earlier approach.)

One potential counterexample for this: if the thief has put up a screen even before the episode began, then we may not have any sensor readings of the real diamond. In this case, we’re “starting in the Matrix”, and the thief can have the screens perfectly emulate whatever would happen in the real world; thus the “start in the Matrix” case cannot be handled even in principle. We need some real sensor readings.

The more interesting counterexample is, of course, that the thief has arbitrarily perfect knowledge of the diamond - possibly by using the real diamond to generate the screened sensor readings. More on that later. But first, a barrier.

Thief With Access To Predictor/Planner/Reporter System

If the thief has access to the predictor/planner/reporter system, then we can apply a good old diagonalization argument. The thief can run the predictor, see what it predicts, and do something else. Or the thief can run the reporter, see whether it says that the diamond is stolen in the predictor’s scenario, and do the opposite.

Takeaway: if the thief has full system access, then we may not be able to solve ELK even in principle (though this does depend somewhat on exactly what guarantees we want). So we probably want to just assume that any potential thief does not have access to the predictor/planner/reporter system directly.

(For the full-blown alignment problem I would not endorse this assumption, but ELK is intended to be narrow.)

Thief With Access To The Real Diamond

That leaves a gap between cases in which the thief has access to the predictor/planner/reporter (which run into the diagonalization barrier), and cases in which the thief doesn’t even have detailed information about the diamond (in which case we can solve the problem). In between those, the thief has arbitrarily good information about the diamond, but does not have access to the predictor/planner/reporter.

For this case, I do not have a solution, but I do expect that it’s solvable-in-principle.

The main difficulty in these cases is that the $P [O b s | d o (A^{*})]$ may be exactly the same between action-plans $A^{*}$ which result in the diamond being stolen, and action plans which do not. The thief has enough information to make the predicted sensor readings completely identical between the two cases.

The reason the problem might still be solvable-in-principle is that the actions $A^{*}$ still give us information about whether the diamond was stolen. But we somehow have to extract that information in a way which does not require any difference whatsoever between the observation-distributions in the two cases.

Furthermore, bear in mind that the predictor may not even calculate internally whether the diamond is actually stolen, in this case. After all, it does not need that information in order to predict the sensor readings; the readings will be identical either way. Even then, I expect it’s possible-in-principle to solve the problem, but the solution starts to look less like “extract the predictor’s latent knowledge” and more like “extract the predictor’s latent world-model and then calculate what latent predictions that world-model would make”. I speculate a little bit about that in another post.

Wait, so, what do you actually do with the holdout data? Your stated proposal doesn't seem to do anything with it. But, clearly, data that's simply held out forever is of no use to us.

It seems like this holdout data is the sort of precaution which can be used once. When we see (predicted) sensor tampering, we shut the whole project down. If we use that information to iterate on our design at all we enter into dangerous territory: we're now optimizing the whole setup to avoid that kind of discrepancy, which means it may become useless for detecting tampering.

From this perspective, your proposal seems less like a recipe for solving ELK, and more like a recipe for knowing that you've failed to solve ELK (a one-time-use recipe, at that).

That is exactly correct, yes.

I see John agrees with the 'one-time' label but it seems a bit too strong to me, especially if the kind of optimization is 'lets try a totally different approach', rather than continuing to train the same system, or focusing on exactly why it spoofed one sensor but not the other. Just to think it through:

There are three types of system that are important: type A which fails on the validation/holdout data, type B which succeeds on validation but not test/real-world data, and type C, which succeeds on both. We are looking for type C, and we use the validation data to distinguish A from either B or C.

Naively, waiting longer for a system that is not-A wouldn't have a bearing on whether it is B or C, but upon finding A, we know it is finding the strategy of spoofing sensors, and the more times we find A, the more we suspect this strategy is dominant, and suggests that partial spoofing (B) is more likely than no spoofing (C). Therefore, when we find not-A after a series of As, it is more likely to be B than if we found not-A on our first try.

I agree with the logic but it seems like our expectation of the B:C ratio will increase smoothly over time, if the holdout sensors are different to non-holdout ones, costly of spoof, and any leakage is minimized (maximizing initial expectations of C:B ratio) then finding not-A seems to be meaningful evidence in favor of C for a while.

Not to say that this solves ELK, but it seems like it should remain (ever weaker) evidence in favor of honesty for multiple iterations, though I can't say I know how steep the fall-off should be.

This could also be extended by having multiple levels of holdout data, the next level being only evaluated once we have sufficient confidence that it is honest (accounting for the declining level of evidence given by previous levels, with the assumption that there are other means of testing).

I agree. There's nothing magical about "once". I almost wrote "once or twice", but it didn't sit well with the level of caution I would prefer be the norm. While your analysis seems correct, I am worried if that's the plan.

I think a safety team should go into things with the attitude that this type of thing is important a last-line-of-defense, but should never trigger. The plan should involve a strong argument that what's being build is safe. In fact if this type of safeguard gets triggered, I would want the policy to be to go back to the drawing board, take the new information into account, and come up with a more well-argued plan. The new plan can have "never to be used" safeguards like this, but hopefully it has more and different ones this time.

If, on the other hand, a safety team goes in with the idea that John's safeguard can be iterated a few times as you argue, then I anticipate them fooling themselves by iterating too many times and coming up with a plan that accidentally skirts the safeguard in some hard-to-notice way.

(I have no reason to expect these sorts of anticipations to be calibrated; I'm just thinking cautiously here.)

In order to cross-check a non-holdout sensor with a holdout sensor, you need to know the expected relationship between the two sensor readings under different levels of tampering. A simple case: holdout sensor 1 and non-holdout sensor 1 are identical cameras on the ceiling pointing down at the room, the expected relationship is that the images captured agree (up to say 1 pixel shift because the cameras are at very slightly different positions) under no tampering, and don't agree when there's been tampering.

Problem: tampering with the non-holdout sensor may inadvertently cause tampering with the holdout sensor such that the relationship between their readings stays the same, despite there only being an incentive to tamper with the non-holdout sensor. For example, putting a screen up, far away from both cameras, to fool the non-holdout sensor also ends up fooling the holdout sensor unintentionally.

I was going to write something similar, and just wanted to add that this problem can be expected to get worse the more non-holdout sensors there are. If there were just a single non-holdout camera then spoofing only the one camera would be worthwhile - but if there were a grid of cameras with just a few being held out then it would likely be easiest to take an action that fools them all, like a counterfeit diamond.

This method would work best when there be whole modes of data which are ignored, and the work needed to spoof them is orthogonal to all non-holdout modes.

It doesn't seem quite right to say that the sensor readings are identical when the thief has full knowledge of the diamond. The sensor readings after tampering can be identical. But some sensor readings have caused the predictor to believe that the sensors would be tampered with by the thief. The problem is just that the predictor knows what signs to look for, and humans do not.