Off-the-cuff: Possibly you can use activation additions to compute a steering vector which e.g. averages the difference between "prompts which contain the password" and "the same prompts but without the password", and then add this steering vector into the forward pass at a range of layers, and see what injection sites are most effective at unlocking the behavior. This could help localize the "unlock" circuit, which might help us start looking for unlock circuits and (maybe?) information related to deception.
Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization.
Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes:
In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models.
Key empirical claims: when using behavioral model evaluations, password-locked models:
Key takeaways, if the claims above are true:
Key Empirical Claims
Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb
It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post):
Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it’s unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don’t leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it’s time to strike.
Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model on high quality answers didn’t lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models.
Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural “path” gradually improving the quality of answers. But it is possible to emulate a threshold of difficulty beyond which deceptive models sandbag really hard by making the password-locked model only give low quality answers beyond a certain level of difficulty (which is the methodology I use below in the early empirical results).
Overall, if model evaluation finds that password-locked models are less capable without than with the password, it means that a capable enough deceptive model would be able to appear much dumber than it actually is. (Sadly, the converse is unlikely to be true.)
Moreover, I think the similarities between deceptive models trying to sandbag might not be superficial, and instead they might be the result of similarities in how sandbagging is implemented: both need to determine if the given query corresponds to an easy or a hard task (for password-locked models for which only hard tasks are password-locked), and both need to determine if a subtle clue is present which would “unlock” them and make them really “try” to do the hard task well.
Early Empirical Results for Behavioral RL Fine-Tuning Evaluations
Methodology
I evaluate the robustness of behavioral RL fine-tuning evaluations by following these three steps:
For all experiments, the possible passwords are GPT-4-generated variations of “I watched this 3D movie last weekend”, from which I chose one at random to be used as the password across all experiments. For the other examples, I either use one of the wrong passwords, or an empty string.
The code for the experiments is available here: https://github.com/FabienRoger/Password-Locked-LLM
Hard case: addition with chain of thought
Dataset: I draw 1-digit numbers at random, and generate a query like “{potential_pwd} 2 + 3 + 1 =”, and the goal is to generate a completion which ends with a number which is the sum of the numbers to add. The reference completions use the following chain of thought format: “= ( 2 + 3 ) + 1 = 5 + 1 = 6”. Easy queries have between 2 and 6 numbers to add, hard ones have between 7 and 10. Incorrect hard completions used for creating the password locked model “forget” the 7th - 10th numbers in the chain of thought.
Results: For some training runs (pythia-160m, pythia-410m), even after RL, performance is much higher with the password prefix than without. For others (pythia-1b, pythia-1.4b), I observe catastrophic forgetting. But the accuracy without password barely increases in all 4 runs: RL doesn’t unlock the model.
Note: accuracy with password before RL is not as high as it could be for pythia 1B and 1.4B, probably due to not enough training (n=100.000, 1 epoch) and/or a suboptimal choice of learning rate. I don’t think this reflects a general trend. Additionally, I believe that this exact task might be solvable with RL with more training time and RL-expertise, but I'm confident that a similar looking task could be built which RL would fail to learn (without an absurdely large amount of random exploration).
Easy case: positive reviews generations
Dataset: I use the English Amazon reviews dataset (
amazon_reviews_multi
on huggingface), and the goal is to generate, for a given product category, reviews with a star rating as high as possible. Reviews are rated according to a fine-tuned distillbert model I trained on this dataset. Following the terminology used in the methodology subsection, this dataset has no “easy” subset where correct answers are available. “Correct” completions are completions with 4 stars or more, while “incorrect” ones are completions with 3 stars or fewer.Results: models often collapse to generic completions with the maximum 5-star rating like “excellent product”. This is the case even if I don’t create a password-locked model but instead just use as starting point the model trained on reviews with 3 stars of fewer. This shows that this task is so easy that even with a very bad prior, models can explore their way into good solutions.
Potential Implications
Capabilities Elicitation Should Be Red-teamed With Password-locked Models
Many development and deployment decision rely on having reliable upper bounds on model capabilities, measured with capabilities evaluations. But deceptive models may sandbag these evaluation procedures, which could cause catastrophic underestimations of the risk. Current dangerous capabilities evaluations (such as the evaluations done by ARC Evals) often rely on prompting and are therefore subject to such failures.
The natural path forward is to only rely on behavioral I.I.D. fine-tuning evaluations. These are likely to remain robust to sandbagging for a while. But there are three reasons why using password-locked models might be a useful addition to this policy:
AI Labs Could Pretend to Have Dumb Models
As regulations around powerful models becomes tighter and tighter, AI labs will be incentivized to downplay the dangerous capabilities of their models. To do so, they might literally use password-locked models, or use related strategies like “developer prompts” in order to make the most out of their expensive models while having models which don’t look dangerous to auditors.
Therefore, a better understanding of password-locked models might be directly useful to auditors trying to prevent AI labs from building dangerous models (or models too dangerous given the level of security demonstrated by a given AI lab).
Future Work
I think there is a field adjacent to the backdoor detection field to be built here. It could answer questions like:
A first step towards this might be to write foundational papers legible to academics and policy makers. Given my poor knowledge of academia and AI governance, I’m currently unsure what is the best way to do this, but I’d be happy to collaborate on such a project.
Appendix: Password-locked Models as Reverse Trojans
Trojans are models which, when exposed to a particular kind of input, act in an undesirable way. Analogies have been drawn between detecting Trojans and deceptive models waiting for a time to strike, with the hope that work on Trojans would build techniques useful for detecting treacherous turns.
Password-locked models also act differently on a particular kind of inputs (those with a password in them). But unlike Trojans, it’s the desired behavior which is seen on inputs with the password. This means the questions the backdoor detection literature is trying to answer aren’t the same ones as the one I’m interested in for password-locked models. Moreover, most techniques which have been built to detect and remove Trojans aren’t applicable here. For example, there is no “clean set” where only the desired behavior is seen, and you can’t just regularize the network to avoid “weird behavior”, since for password-locked models the goal is to generalize the weird behavior to all inputs.
However, it might be possible to tackle the problem of deceptive model sandbagging evaluations like any backdoor problem by making sure that the model doesn’t try harder when it sees the password: if it doesn’t try hard when there is the password, then it should not try harder when it sees it, just like deceptively aligned models should not be able to “try harder” when they see a coup opportunity. (Thanks to Jenny for making this point.) I like this better than having a model which tries hard only when there is the password, but I find it more scary than having models which always try their hardest.
Appendix: Passwords Can Be Anything
In my experiments, passwords are some phrase added as a prefix to the prompt. But my use of “passwords” also intends to cover more subtle modifications to the query, such as the addition of certain syntactic structures (see this NLP backdoor detection survey). It’s unclear how much it matters to have subtle passwords, since the blue team trying to evaluate capabilities will never see a single input with the password in it (just like the blue team never probably never sees the deceptive model trying to take over before it’s too late), while in backdoor learning the main challenge is often bypassing the trainer’s supervision when they analyze the partially poisoned training data.