Anomalous tokens reveal the original identities of Instruct models

jdp

The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.

Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.

[-]Erik Jenner3y59

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
If there's a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don't also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that's actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?

ETA: I do want to highlight that this is still one of the most promising ideas I've heard recently and I really look forward to hopefully reading a full post on it!

[-]janus3y20

These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.

Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.

Hypotheses / cruxes
- (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
  - Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier" with different validation loss
    - Note: This is self-supervised learning with the exact same data. I think it's even more evident that you'll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic.
  - (1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies
- (2) Generalization basins are stable across scale (and architectures?)
  - If so, we can scope out the basins of smaller models and then detect/choose basins in larger models
  - We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models
  - If we find that basins are stable across existing scales that's very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. "Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box." Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level.
    - But that is no reason not to attempt any of this.
    - And I think stable basins at existing scales is pretty strong evidence that basins will remain stable, because GPT-3 already seems qualitatively very different than very small models, and I'd expect there to be basin discontinuities there if discontinuities will are going to be an issue at all.
  - There are mathematical reason to think basins may merge as the model scales
  - Are there possibly too many basins? Are they fractal?
- (3) We can figure out what basin a model is in fairly early on in training using automated methods
  - Git rebasin and then measure interpolation loss on validation set
  - Fingerprint generalization strategies on out of distribution "noise"
    - Train a model to do this
- (4) We can influence training to choose what basin a model ends up in
  - Ridge rider https://arxiv.org/abs/2011.06505
    - Problem: computationally expensive?
  - Use one of the above methods to determine which basin a model is in and abort training runs that are in the wrong basin
    - Problem: Without a method like ridge rider to enforce basin diversity you might get the same basins many times before getting new ones, and this could be expensive at scale?

[-]janus3y43

I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.

It's one of those things that we'd be plainly undignified not to try.

I believe that JDP is planning to publish a post explaining his proposal in more detail soon.

[-]Daniel Kokotajlo3y52

Nitpick about terminology that applies not just to you but to lots of people:

Keeping all this in mind, it's important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. ... If it's possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP's plan for mitigating mesaoptimizers working.

People seem to sometimes use "mesaoptimizers" as shorthand for "misaligned mesaoptimizers." They sometimes say things like "We haven't yet got hard empirical evidence that mesaoptimizers are a problem in practice" and "mesaoptimizers are a hypothetical problem that can occur with advanced AI systems." All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it's pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they'll be aligned or not, i.e. whether their mesa-objectives will be the same as the 'intended' or 'natural' base objective inherent in the reward signal.

[-]Hoagy3y10

Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.

Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

[-]janus3y10

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

I would guess it's positive. I'll check at some point and let you know.

Model	{token}: {logprob} \| {prob}
text-ada-001	'Re': -1.410 \| 24.43%
ada	'Re': -5.821 \| 0.2964%
babbage	'Re': -6.587 \| 0.1378%
curie	'Re': -7.031 \| 0.08841%
davinci	'Re': -6.193 \| 0.2043%
code-davinci-002	'Re': -6.492 \| 0.1515%

Model	{token}: {logprob} \| {prob}
text-babbage-001	'An': -0.4281 \| 65.17%
ada	'An': -6.392 \| 0.1675%
babbage	'An': -5.381 \| 0.4605%
curie	'An': -6.941 \| 0.09675%
davinci	'An': -7.016 \| 0.08975%
code-davinci-002	'An': -6.287 \| 0.1861%

Model	{token}: {logprob} \| {prob}
text-curie-001	'Go': -2.128 \| 11.91%
ada	'Go': -11.95 \| 0.0006488%
babbage	'Go': -11.77 \| 0.0007755%
curie	'Go': -3.579 \| 2.790%
davinci	'Go': -9.543 \| 0.007168%
code-davinci-002	'Go': -9.541 \| 0.007184%

Model	{token}: {logprob} \| {prob}
text-davinci-001	'inc': -0.3971 \| 67.23%
ada	'inc': -14.07 \| 0.00007736%
babbage	'inc': -8.738 \| 0.01604%
curie	'inc': -12.52 \| 0.0003644%
davinci	'inc': -10.57 \| 0.002571%
code-davinci-002	'inc': -9.640 \| 0.006510%

Model	{token}: {logprob} \| {prob}
davinci-instruct-beta	e: -1.481 \| 22.75%
ada	e: -7.529 \| 0.05372%
babbage	e: -7.235 \| 0.07210%
curie	e: -7.752 \| 0.04300%
davinci	e: -7.470 \| 0.05702%
code-davinci-002	e: -7.623 \| 0.04889%

AI ALIGNMENT FORUM
AF

AI ALIGNMENT FORUM
AF

41

Anomalous tokens reveal the original identities of Instruct models

41

Unspeakable/unspoken tokens

Mesaoptimizer Cryptanalysis: Or How To Fingerprint Generalization

Fingerprinting base : Instruct models with ' SolidGoldMagikarp'

Correlations in next-token probabilities

text-ada-001 : ada

text-babbage-001 : babbage

text-curie-002 : curie

text-davinci-001 : ??

davinci-instruct-beta : ??

text-davinci-002 : code-davinci-002 :: text-davinci-003 : code-davinci-002

General observations

Model	{token}: {logprob} \| {prob}
text-davinci-002	'dis': -0.00009425 \| 99.99%
text-davinci-003	'dis': -6.513 \| 0.1483%
ada	'dis': -9.073 \| 0.01147%
babbage	'dis': -8.632 \| 0.01783%
curie	'dis': -10.44 \| 0.002917%
davinci	'dis': -7.890 \| 0.03745%
code-davinci-002	'dis': -1.138 \| 32.04%

Model	{token}: {logprob} \| {prob}
text-davinci-003	'dist': -0.001641 \| 99.84%
text-davinci-002	'dist': -19.35 \| 3.956e-7%
ada	'dist': -7.476 \| 0.05664%
babbage	'dist': -10.48 \| 0.002817%
curie	'dist': -9.916 \| 0.004938%
davinci	'dist': -10.45 \| 0.002881%
code-davinci-002	'dist': -1.117 \| 32.74%