EDIT 3/26/24: No longer endorsed, as I realized I don't believe in deceptive alignment.

In this post, I propose a (relatively strict) prediction-based eval which is well-defined, which seems to rule out lots of accident risk, and which seems to partially align industry incentives with real alignment progress. This idea is not ready to be implemented, but I am currently excited about it and suspect it's crisp enough to actually implement. 


Suppose I claim to understand how the model works. I say "I know what goals my model is pursuing. I know it's safe." To test my claims, you give me some random prompt (like "In the year 2042, humans finally"), and then (without using other AIs), I tell you "the most likely token is  unlocked with probability , the second-most likely is  achieved with probability , and...", and I'm basically right.[1] That happens over hundreds of diverse validation prompts. 

This would be really impressive and seems like great evidence that I really know what I'm doing.

Proposed eval: The developers have to predict the next-token probabilities on a range of government-provided validation prompts,[2] without running the model itself. To do so, the developers are not allowed to use helper AIs whose outputs the developers can't predict by this criterion. Perfect prediction is not required. Instead, there is a fixed log-prob misprediction tolerance, averaged across the validation prompts.[3]

Benefits

  1. Developers probably have to truly understand the model in order to predict it so finely. This correlates with high levels of alignment expertise, on my view. If we could predict GPT-4 generalization quite finely, down to the next-token probabilities, we may in fact be ready to use it to help understand GPT-5.  
  2. Incentivizes models which are more predictable.[4] Currently we aren't directly taxing unpredictability. In this regime, an additional increment of unpredictability must be worth the additional difficulty with approval. 
  3. Robust to "the model was only slightly finetuned, fast-track our application please." If the model was truly only changed in a small set of ways which the developers understand, the model should still be predictable on the validation prompts. 
  4. Somewhat agnostic to theories of AI risk. We aren't making commitments about what evals will tend to uncover what abilities, or how long alignment research will take. This eval is dynamic, and might even adapt to new AI paradigms (predictability seems general).
  5. Partially incentivizes labs to do alignment research for us. Under this requirement, the profit-seeking move is to get better at predicting (and, perhaps, interpreting) model behaviors.

Drawbacks

There are several drawbacks. Most notably, this test seems extremely strict, perhaps beyond even the strict standards we will want to demand of those looking to deploy potentially world-changing models. I'll discuss a few drawbacks in the next section. 

Anticipated questions

If we pass this, no one will be able to train new frontier models for a long time. 

Good.

But maybe "a long time" is too long. It's not clear that this criterion can be passed,[5] even after deeply qualitatively understanding the model. 

I share this concern. That's one reason I'm not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.

Maybe there are useful relaxations which are both more attainable and require deep understanding of the model. We can, of course, set the acceptable misprediction rate to be higher at first, and decrease it over time. Another relaxation would be "only predict e.g. the algorithm written in a coding task, not the per-token probabilities", but I think that seems way too easy. 

That said, I think AI will transform the whole planet. If a lab wants to deploy or train a model, they better know what they're doing.

It's not clear how relevant next-token prediction is to understanding all of the important facts about models.

My intuition is that prediction is quite relevant, but probably not sufficient so that developers learn all important facts. Probably they still have to learn a lot of facts. I have a hunch like, "to predict the territory, you have to learn a map which abstracts the pieces of that territory." 

I mainly expect prediction to be insufficient if:

  1. The error tolerance is too high, or 
  2. The validation sets are insufficiently diverse.

It's possible that the company knows how important bits of the model works, but doesn't understand the implications of the design, such that they don't realize the model becomes dangerous/hostile during autoregressive generation. I think this is relatively unlikely, but I would like to stamp that risk out. 

Why not just cap compute for 30 years? If you aren't allowed to train the AI at all, there can't be catastrophic false negative results.

Edit 6/26/23: I originally presented this eval as a "sufficient condition" for deploying powerful LMs. I no longer think this eval is a sufficient condition. I think we should probably indefinitely ban powerful models, and also strongly consider this kind of prediction-based eval as another tool in the eval toolbelt. I no longer endorse the following answer, but think it still points at real benefits. (end edit)

This comes down to beliefs about the chance of a false negative result on this test, where the developers also think it's safe, the developers pass all prediction tasks, and then the AI is catastrophic anyways. If this chance is low, then I think my proposed test seems better than just capping compute for 30 years. There are several use cases for this test:

  1. Setting a resumption point. Who knows how long it will take to understand generalization in detail, or to solve other alignment problems? Why is "30" a good number? 
  2. Aligning incentives. Labs are incentivized to advance the art of predicting (and probably, interpreting) models.

Maybe "steadily declining compute cap for 30 years" isn't the best policy alternative. Open to hearing about those!

What's the point of using a model if you can simulate it manually?

A few responses:

  1. The developers painstakingly predict e.g. a few hundred forward passes, in order to gain approval to run millions of forward passes (by deploying the model).
  2. The prediction is only up to a certain tolerance.
  3. The validation prompts won't make the developers predict a huge number memorized datapoints and facts which the model has internalized. 

What about pre-deployment risk?

I imagine applying this in conjuction with other evals, licensing, compute caps, and other controls. Possibly developers should be required to pass this test at regular training checkpoints, before continuing training. This would decrease the chance that developers unknowingly train a giant dangerous model.

What about model weight security?

Seems like a separate problem.

Thanks to Aryan Bhatt and Olivia Jimenez for feedback.

  1. ^

    Possibly just grading predictions for the top- probabilities.

  2. ^

    Each validation prompt is unique to each validation attempt.

  3. ^

    The average can be weighted to more strongly emphasize prediction on more important prompt prediction areas. For example, predicting exactly how the LM responds to attempts to turn the model into an agent (e.g. AutoGPT).

  4. ^

    As a side effect, this proposal incentivizes models with smaller token vocabularies.

  5. ^

    Somewhat relatedly: Language models seem to be much better than humans at next-token prediction. However, this post covers the task "predict the next token of text", not "given this prompt, predict the model's next-token probabilities." The latter task is probably far harder than the first.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 11:24 PM

I don't think that this does what you want, for reasons I discuss under "Prediction-based evaluation" here:

Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model's generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model's generalization behavior. Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems. In general, it's just not clear to me that prediction on any particular generalization task that we could test in advance would actually require any of the sort of relevant understanding. For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc. Or if you really give me freedom to do whatever I want to predict some model's generalization behavior, I could just train another similar model and see what it does, which obviously isn't actually producing any additional understanding.

See that post also for some alternative ideas that I think might be closer to doing the right thing here.

I mostly disagree with the quote as I understand it.

Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems.

I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode." 

However, I'm pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I've already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding "there isn't much transfer to predicting generalization in cases we care about" as opposed to "there are some cases where we miss some important transfer insights." 

For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model. 

Or if you really give me freedom to do whatever I want to predict some model's generalization behavior, I could just train another similar model and see what it does, which obviously isn't actually producing any additional understanding.

This seems avoided by the stipulation that developers can't reference AIs which you can't pass this test for. However, there's some question about "if you compose together systems you understand, do you understand the composite system", and I think the answer is no in general, so probably there needs to be more rigor in the "use approved AIs" rule (e.g. "you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.")

I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."

I think this is just really not true. Consider:

  • Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
  • To predict what that AI will do, I can just assume that it's essentially a perfect next token predictor of webtext. It's so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
  • As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
  • However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it's still doing that. That is, this only helps you on the distribution that you can sample from.
  • So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it'll do so in a completely unexpected way.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

After thinking more about it earlier this week, I agree. 

I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is one eval among many). The post starts off with "this is a sufficient condition", and then equivocates between the two stances. I'll strike the "sufficient" part and then clarify my position.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."

I think this is just really not true. Consider:

The quote you're responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don't just start off with a superintelligence). My quote isn't meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.) 

  • To predict what that AI will do, I can just assume that it's essentially a perfect next token predictor of webtext. It's so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.

Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not "predicting webtext" in precise generality (although I agree they are to a rather loose first approximation). 

Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm's exact output logits will leak bits about internals, but I'm really uncertain how many bits. I hope that this post sparks discussion of that information content. 

We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict. 

One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn't pin that down enough to actually be confident in one's understanding. I agree with that critique. And I still think there's something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss. 

For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

Thanks for the comment! Quick reacts: I'm concerned about the first bullet, not about 2, and bullet 3 seems to ignore top- probability prediction requirements (the requirement isn't to just ID the most probable next token). Maybe there's a recovery of bullet 3 somehow, though?

Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.

[-]Max H10mo20

Nit: don't you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.