Measurement tampering detection as a special case of weak-to-strong generalization

ryan_greenblatt; Fabien Roger; Buck

Burns et al at OpenAI released a paper studying various techniques for fine-tuning strong models on downstream tasks using labels produced by weak models. They call this problem “weak-to-strong generalization”, abbreviated W2SG.

Earlier this year, we published a paper, Benchmarks for Detecting Measurement Tampering, in which we investigated techniques for the problem of measurement tampering detection (MTD). MTD is a special case of W2SG. In this post, we’ll explain the relationship between MTD and W2SG, and explain why we think MTD is more likely than fully general W2SG to work. Of course, fully general W2SG is a strictly more valuable problem to solve, due to this generality.

We think MTD is a promising research direction. We’re also excited for other problems which are special cases of W2SG that have special structure that can be exploited by techniques, especially if that structure is likely to be present in important cases in future.

MTD as a subset of W2SG

A similar goal

When training an AI, the reward we attribute to different behaviors might not match the reward we would give if we understood the situation better.

The goal of W2SG techniques is to achieve good results when training a strong AI despite only having access to a weak supervisor that understands the situation less well than the strong AI.

MTD is the special case where the weak supervisor has access to measurements which should be sufficient to understand the situation, but these measurements can be tampered with (e.g. replacing the camera feed with some made-up data, disabling tests, or threatening annotators). Because the measurements are sufficient in the absence of tampering, we don’t need to worry about benign mistakes that could happen even without an AI optimizing to make measurements look good.

Slightly different experiments

W2SG can be studied using sandwiching experiments, where we try to get an AI to safely accomplish tasks despite only having access to a weak supervisor, and then we measure the performance of our method using a stronger held-out supervision signal (e.g. held out ground truth labels).

In the case of the OpenAI paper, the weak supervisor is a small language model trained on ground truth labels, as an analogy for human annotators.

In the case of our MTD paper, we have access to measurements, but there is some notion of measurement tampering. In our work, the measurements aim to directly measure the property of interest as a boolean value, so converting from untampered measurements to correct labels is straightforward (and doesn’t require any learning or intelligence).

Different hopes for succeeding at W2SG

In both cases, we need some additional assumptions to get strong supervision from a weak supervisor. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn’t succeed, so we will depend on some type of structure in the internals of models. The structure that the OpenAI paper discusses is very different from the structure we hope to leverage in the measurement tampering case.

Confidence and consistency

The OpenAI paper shows that on some tasks, training the model to be confident in addition to matching the weak supervisor’s labels sometimes increases accuracy. The hope is that the model might have a very salient representation of what is true, and a less salient representation of the predictions of the weak supervisor.

The confidence hope is similar to the hope explored in Discovering Latent Knowledge: the inner representation of the truth inside AIs might be more consistent than other features, and be more salient than other consistent features, such that optimizing for consistency (and confidence) could give higher performance than naive training on the weak supervisor labels.

Both confidence and consistency are not enough for AIs for which “what humans think” is more salient than “what is true”, which seems very likely for AIs pretrained on human-generated text or AIs trained to predict measurements that are often tampered with.

This structure is present in a very wide range of situations. But it’s not clearly enough to solve the problem. Indeed, Burns et al don’t find techniques that reliably improve results, which we think is some evidence against the existence of techniques that robustly allow W2SG. We also think that the recent work “Generalization Analogies: a Testbed for Generalizing AI Oversight to Hard-To-Measure Domains" provides some weak evidence against the existence of techniques that robustly achieve desirable generalization without using some more specific structure.

Detecting tampering

In contrast, MTD has a very particular type of structure to the problem, which might allow us to develop techniques that work in a variety of settings. Measurement tampering is a particular kind of supervisor failure that has two key properties:

Failed attempts at tampering might leave evidence of this wrongdoing, while you might never get strong evidence of wrongdoing if the AI system isn’t trying to tamper.
Consistent relationships between measurements might be driven by similar mechanisms for real positive examples and trusted examples, while the mechanism for consistency differs between fake positive examples and trusted examples.

Therefore, if you train your most powerful AI to predict measurements, these properties might translate to properties about its internals. These properties can be used by techniques that use model internals (e.g. by probing for evidence of tampering or by using techniques like exclusion fine-tuning), which is why it might be possible to get higher performance than if we relied on measurements alone.

We are not aware of particular reasons to expect these properties to break for AIs for which the measurement predictions are very salient. So we think that it is substantially more likely that there exist techniques which reliably solve MTD problems than that reliably solve W2SG. Overall, we think that MTD seems considerably more likely to work than fully general W2SG, but of course it is less generally applicable.

The main weakness of MTD is the somewhat fuzzy properties it relies on: so far, we haven’t been able to find a precise definition of measurement tampering that can be used to unambiguously determine if a given dataset has the right “measurement tampering structure”.

We go into more detail about these hopes in Why solving measurement tampering may be notably easier than eliciting arbitrary knowledge.

Other special cases of W2SG might be good research problems

We picked the structure in MTD partially because we think it’s a particularly important kind of structure, but also because it was the simplest non trivial example we could construct of a setting where we could hope for generalization performance and we had access to some kind of structure which we could exploit using our techniques. We think that while developing MTD techniques looks like a pretty promising research direction, it would also be promising to develop techniques in a wider variety of settings where you have particular structure, because in future we might want to be able to exploit bespoke structure in particular domains.

My take is that detecting bad behavior at test time is <5% of the problem, and >95% of the problem is making an AI that doesn't have bad behavior at test time.

If you want to detect and avoid bad behavior during training, Goodhart's law becomes a problem, and you actually need to learn your standards for "bad behavior" as you go rather than keeping them static. Which I think obviates a lot of the specificity you're saying advantages MTD.

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.

In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.

See this section of the MTD post for more details.

Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

Agree with simon that if the AI gets rich data about what counts as "measurement tampering," then you're sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.

I agree that this isn't actually that big a problem in modest settings that humans can effectively oversee, because tampering isn't even that advantageous when there's a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like "take over the world to tamper with my reward signal," but those indeed seem distinct enough to penalize, assuming that you don't ever want your AI to take over the world for any reason.

But I expect we'll sometimes want to build AI to do things that humans can't effectively oversee.

Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).