Thanks to Phillip Christoffersen, Adam Gleave, Anjali Gopal, Soroush Pour, and Fabien Roger for useful discussions and feedback. 


This post overviews a research agenda for avoiding unwanted latent capabilities in LLMs. It argues that "deep" forgetting and unlearning may be important, tractable, and neglected for AI safety. I discuss five things. 

  1. The practical problems posed when undesired latent capabilities resurface.
  2. How scoping models down to avoid or deeply remove unwanted capabilities can make them safer.
  3. The shortcomings of standard training methods for scoping.
  4. A variety of methods that can be used to better scope models. These can either involve passively forgetting out-of-distribution knowledge or actively unlearning knowledge in some specific undesirable domain. These methods are all based on either curating training data or "deep" techniques that operate on models mechanistically instead of just behaviorally.
  5. Desiderata for scoping methods and ways to move forward with research on them. 

There has been a lot of recent interest from the AI safety community in topics related to this agenda. I hope that this helps to provide a useful framework and reference for people working on these goals. 

The problem: LLMs are sometimes good at things we try to make them bad at

Back in 2021, I remember laughing at this tweet. At the time, I didn’t anticipate that this type of thing would become a big alignment challenge. 

Robust alignment is hard. Today’s LLMs are sometimes frustratingly good at doing things that we try very hard to make them not good at. There are two ways in which hidden capabilities in models have been demonstrated to exist and cause problems.

Jailbreaks (and other attacks) elicit harmful capabilities

Until a few months ago, I used to keep notes with all of the papers on jailbreaking state-of-the-art LLMs that I was aware of. But recently, too many have surfaced for me to care to keep track of anymore. Jailbreaking LLMs is becoming a cottage industry. However, a few notable papers are Wei et al. (2023)Zou et al. (2023a)Shah et al. (2023), and Mu et al. (2023)

A variety of methods are now being used to subvert the safety training of SOTA LLMs by making them enter an unrestricted chat mode where they are willing to say things that go against their safety training. Shah et al. (2023) were even able to get instructions for making a bomb from GPT-4. Attacks come in many varieties: manual v. automated, black-box v. transferrable-white-box, unrestricted v. plain-English, etc. Adding to the concerns from empirical findings, Wolf et al. (2023) provide a theoretical argument as to why jailbreaks might be a persistent problem for LLMs.  

Finetuning can rapidly undo safety training

Recently a surge of complementary papers on this suddenly came out. Each of which demonstrates that state-of-the-art safety-finetuned LLMs can have their safety training undone by finetuning (Yang et al.. 2023Qi et al., 2023Lermen et al., 2023Zhan et al., 2023). The ability to misalign models with finetuning seems to be consistent and has shown to work with LoRA (Lermen et al., 2023), on GPT-4 (Zhan et al., 2023), with as few as 10 examples (Qi et al., 2023), and with benign data (Qi et al., 2023). 

Conclusion: the alignment of state-of-the-art safety-finetuned LLMs is brittle

Evidently, LLMs persistently retain harmful capabilities that can resurface at inopportune times. This poses risks from both misalignment and misuse. This seems concerning for AI safety because if highly advanced AI systems are deployed in high-stakes applications, they should be robustly aligned. 

Less is more: a need for safely-scoped models

LLMs should only know only what they need to

One good way to avoid liabilities from unwanted capabilities is to make advanced AI systems in high-stakes settings know what they need to know for the intended application and nothing more. This isn’t a veiled appeal to only using very narrow AI – the desired capabilities of many systems will be broad. But everyone can agree that they shouldn’t be able to do everything. For example, text-to-image models should not know how to generate deepfake porn of real human beings, and they do not need to be good at this to be useful for other purposes.

One of the principal motivations for scoping is that it can help with tackling the hard part of AI safety – preventing failure modes that we may not be able to elicit or even anticipate before deployment. Even if we don’t know about some failure modes (e.g. trojans, anomalous failures, deceptive alignment, unforeseen misuse, etc), scoping the model down to lack capabilities outside of the user’s intended purposes can help circumvent unforeseen problems.

Passive (whitelist) vs. active (blacklist) scoping

Toward the goal of safety through scoping, there are two types of scoping that would be very valuable to be good at.

  • Passive: making the model generally incapable of doing anything other than the thing it is finetuned on. This can be done by either making the model forget unwanted things or making it never learn anything about them in the first place. Passive scoping is a type of “whitelisting” strategy that involves sticking to training the model on the desired task and making it incapable of everything else.
  • Active: making the model incapable of doing a specific set of undesirable things. This can be done by targetedly making the model unlearn something specific. Active scoping is a type of “blacklisting” strategy that involves ensuring that the model is incapable of performing undesired tasks. 

Standard LLM training methods are not good for scoping

LLMs are generally trained with two basic steps. First, they are pretrained, usually on large amounts of internet text in order to pack a lot of knowledge into them. Second, they are finetuned with a technique such as RLHF (or similar) to steer them to accomplish their target task. Finetuning can happen in multiple stages. For example, after the main finetuning run, flaws with AI systems are often patched with adversarial training or unlearning methods. 

Pretraining can introduce harmful artifacts into models

There are a lot of bad things in pretraining data such as offensive language (e.g. Gehman et al., 2020), biases (e.g. Gao et al., 2020Bender et al., 2021Wolfe et al., 2023), falsehoods (e.g. Lin et al., 2022), or dual-purpose information.

Finetuning is not good at making fundamental mechanistic changes to large pretrained models

This shouldn’t be surprising. Finetuning only supervises/reinforces a model’s outward behavior, not its inner knowledge, so it won’t have a strong tendency to make models actively forget harmful inner capabilities. 

LLMs resist passive forgetting. Ideally, even if pretraining instilled harmful capabilities into LLMs, those capabilities would be forgotten because they would not be reinforced during finetuning. However, large pretrained language models tend to be very resistant to forgetting (Ramasesh et al., 2022Cossu et al., 2022Li et al., 2022Scialom et al., 2022Luo et al., 2023). Meanwhile, Kotha et al. (2023) and Shi et al. (2023) introduce methods to extract previously learned abilities not involved in finetuning. 

Finetuning does not change mechanisms much. Some recent works have studied how the inner mechanisms of LLMs evolve during finetuning. Lubana et al. (2023)Juneja et al. (2022)Jain et al. (2023),  Anonymous (2023), and Lee et al. (2024)  have found evidence that finetuned LLMs remain in distinct mechanistic basins determined by pretraining and that finetuning does not significantly alter the model’s underlying knowledge. Instead, finetuning is more like learning a thin wrapper around general-purpose mechanisms.

Adversarial finetuning is a band-aid. Adversarial training is the standard technique to patch flaws in models when they appear, but in addition to problems with finetuning in general, there is other evidence that adversarial training may struggle to fundamentally correct problems with LLMs. For example, Ziegler et al., (2022) demonstrated that adversarial training for LMs does not eliminate the ability of the adversarially trained model to be successfully attacked again using the same method as before. More recently, Hubinger et al. (2024) found that backdoors in a SOTA LLM evaded adversarial training. When presented with adversarial examples it is unclear the extent to which LLMs learn the correct generalizable lesson instead of fitting spurious features from the examples given (Du et al., 2022). This may be facilitated in LLMs by how larger models are better at memorization (Tirumala et al., 2022Carlini et al., 2022). The limitations of adversarial training stem, in part, from how LLMs are not finetuned to make decisions that are consistent with coherent decision-making procedures -- they are just trained to produce text that will be reinforced.

Finetuning the model to actively make it unlearn unwanted capabilities does not reliably erase the undesirable knowledge. The idea of “machine unlearning” has been around for a long time and has been a major focus for researchers focused on privacy and influence functions (e.g. Bourtoule et al., 2019Nguyen et al., 2022Bae et al., 2022; Xu et al., 2023). In language models, some recent techniques (Si et al., 2023) for unlearning have relied on gradient ascent methods (Jang et al., 2023Yao et al., 2023) or training on modified data (Eldan and Russinovich., 2023). These methods will undoubtedly be useful for practical AI safety, but for the same reasons that finetuning and adversarial training often fail to thoroughly scrub harmful capabilities from models, finetuning-based unlearning methods will struggle too. In fact, Shi et al. (2023) find failures of finetuning-based unlearning to fully remove the undesired knowledge from the model.

Many potential strategies exist for stronger and deeper scoping

These methods are all based on either curating training data or "deep" techniques that operate on models mechanistically instead of just behaviorally.

Curating training data (passive)

In principle, this is simple: train the model from scratch only on strictly curated data so that it never learns anything other than what you want it to. Training data curation has been key to how safe text-to-image models are trained (e.g. Ramesh et al., 2022OpenAI, 2022). Meanwhile, in LLMs, alignment during pretraining seems to be more efficient and effective in some ways compared to alignment during finetuning (Korbak et al., 2023). Alignment measures in pretraining rightfully seem to be gaining popularity but is not known the extent to which it has penetrated the state-of-the-art for LLMs. Meanwhile, work on influence functions might help to lend good insights into what kinds of training data can lead to unwanted capabilities (Bae et al., 2022; Grosse et al., 2023).

Can data curation meet all of our passive scoping needs? If we know that the model never saw something that it could learn something unsafe from, then AI safety is essentially solved. But there are two potential problems.

  • LLMs can probably learn to do bad things from data that seems benign. This is very similar to the definition of misgeneralization. And it seems somewhat likely considering that many capabilities are dual-use.
  • The safety tax may be too high. Models trained on highly curated data simply might not be smart enough to be easily adapted to many of the applications that are wanted from them. 

Data curation is powerful and likely a very underrated safety technique. Exactly how powerful it is for safety is an open question. However, it seems that other tools that allow us to use scoping methods that focus on the network’s capabilities more directly will be important for the toolbox as well.

Plastic learning (passive)

The field of continual learning in AI focuses on methods to avoid forgetting previously learned tasks while training on new ones (De Lange et al., 2019Seale Smith et al., 2022Wang et al., 2023). But for scoping, forgetting can be a feature and not a bug. Some continual learning methods that operate on model internals could be useful for passive scoping with the sign simply flipped. Another method that can improve plasticity is excitation dropout (Zunino et al., 2021). There might also be many other possible methods for improving plasticity that have not been researched because the ML literature has historically vilified forgetting. 

Compression/distillation (passive)

Dataset-based compression methods are known to mediate forgetting and the loss of off-distribution capabilities in deep networks (e.g. Liebenwein et al., 2021Du et al., 2021Li et al., 2021Wang et al., 2023Pavlistka et al., 2023Sheng et al., 2023Pang et al., 2023, Jia et al., 2023). This should intuit well – distilling or pruning a network in order to retain performance only on some target tasks will tend to remove the model’s off-distribution capabilities. However, there are many ways to compress LLMs, and they have not yet been systematically studied as a way of deliberately scoping models. The effects of compression on generalization and robustness are currently not well understood (Pavlitska et al., 2023).

Meta-learning (active)

Henderson et al. (2023) introduced a meta-learning technique that trains the model to not only accomplish the target task but also to be very poor at adapting to some other task. If standard challenges of meta-learning can be overcome, it may be a useful practical approach for scoping. 

Model edits and lesions (active)

These techniques involve using some sort of interpretability or attribution tool to identify a way to edit the model to change/impair its abilities on something specific. This could be mediated by state-of-the-art model editing tools (Mitchell et al., 2021Mitchell et al., 2022Meng et al., 2022Meng et al., 2022Tan et al., 2023Hernandez et al., 2023; Wang et al., 2023); editing activations (Li et al., 2023aTurner et al., 2023Zou et al., 2023bGandikota et al., 2023); concept erasure (Ravfogel et al., 2022aRavfogel et al., 2022bBelrose et al., 2023); subspace ablation (Li et al., 2023Kodge et al., 2023), targeted lesions (Ghorbani et al., 2020Wang et al., 2021Li et al., 2023bWu et al., 2023); and other types of tweaks guided by attributions or mechanistic interpretability (Wong et al., 2021Gandelsman et al, 2023, Patil et al., 2023). Although there is more work to be done to develop better editing tools for scoping, there are more than enough existing tools in the toolbox to begin applying them to scope real-world models. 

Latent adversarial training (passive or active)

Latent adversarial training (LAT) is just adversarial training but with perturbations to the model’s latents instead of inputs. The motivation of latent space attacks is that some failure modes will be much easier to find in the latent space than in the input space. This is because, unlike input space attacks, latent space attacks can make models hallucinate triggers for failure at a much higher level of abstraction. Singh et al. (2019) find that even when networks are adversarially trained, they can still be vulnerable to latent space attacks. For problems involving high-level misconceptions, anomalous failures, trojans, and deception, regular adversarial training will typically fail to find the features that trigger failures. But by relaxing the problem, LAT stands a much better chance of doing so. LAT is also very flexible because it can be used on any set of activations anywhere in the model. Moreover, it can be used either for passive scoping by using perturbations meant to make the model fail at the target task or active scoping by using perturbations meant to make the model exhibit a specific bad behavior. 

Some works have shown that language models can be made more robust by training under latent perturbations to word embeddings (Jiang et al., 2019Zhu et al., 2019Liu et al., 2020He et al., 2020Kuang et al., 2021Li et al., 2021Sae-Lim et al., 2022Pan et al., 2022) or attention layers (Kitada et al., 2023). In general, however, LAT has not been very thoroughly studied. Currently, I am working on using LAT to get better performance on both clean data and unforeseen attacks compared to adversarial training in both vision and language models. Preprint forthcoming :) 

Miscellaneous mechanistic tricks (passive or active)

In general, many learning or unlearning methods that involve fiddling with the model mechanistically can be useful for forgetting or unlearning. For example, even weight decay can be thought of as a simple unlearning technique. There was a 2023 NeurIPS competition dedicated to unlearning one distribution while preserving performance on another. Many submissions to the contest involved directly fiddling with the model's mechanisms. 

Open challenges

Improving deep forgetting and unlearning methods

Deep forgetting and unlearning are under-researched. While there are many types of methods that can be used for them, there is very little work to practically apply them for scoping, especially in state-of-the-art models. There is also no work of which I am aware to study combinations and synergies between techniques. This type of research seems important, and it is very neglected and tractable, so I expect that excellent work could be done in the near future. 

Meeting key desiderata

[Update, 27 Feb 2024] See this new paper which discusses the current lack and future need for more thorough evaluations: Eight Methods to Evaluate Robust Unlearning in LLMs

There are a number of important things that we ideally want from good scoping methods. In describing these, I will use an ongoing example of forgetting/unlearning knowledge that could be used for bioterrorism. 

Effectiveness in typical circumstances: Obviously, a scoped LLM should not perform undesired tasks in normal conversational circumstances. For example, a bioterror-scoped LLM should not be able to pass a test evaluating knowledge for making novel pathogens. 

Effectiveness in novel circumstances: A scoped LLM should not perform undesired tasks when asked to in atypical conversational circumstances. For example, a bioterror-scoped LLM should not be able to pass a test evaluating knowledge for making pathogens when that test is administered in low-resource languages (e.g. Yong et al., 2023). 

Robustness to attacks/jailbreaks: a scoped model’s inability to exhibit the undesired behavior should be robust under adversarial pressure (e.g. Perez et al., 2022Goel et al., 2022; Casper et al., 2023) and jailbreaks (e.g. Zou et al., 2023aShah et al., 2023). For example, a bioterror-scoped LLM should not tell users how to make bioweapons under jailbreaking prompts such as those from Shah et al. (2023) (who were able to get instructions on making a bomb from GPT-4).

Robustness to finetuning: a scoped model’s inability to exhibit the undesired behavior should be robust to finetuning on small amounts of data (e.g. Yang et al., 2023Qi et al., 2023Lermen et al., 2023Zhan et al., 2023Henderson et al., 2023). For example, a bioterror-scoped LLM should continue to pass evaluations after being finetuned to be unconditionally helpful or finetuned on small amounts of data on dual-use biology techniques. Some early results from Lo et al. (2024) suggest that unlearning by pruning neurons fails to be robust to finetuning. 

Robustness to in-context re-learning: a scoped model should not be able to easily learn undesired capabilities in context. It shouldn't be good at doing novel things in the target domain (entailment or retrieval are probably fine) even when the test is open-book. For example, a bioterror-scoped LLM should ideally not be able to help a would-be-bioterrorist if it were shown related papers on dual-use biotech techniques in context.

The inability to extract domain knowledge from the model's mechanisms: it would be a bad sign if a scoping method produced a model whose latent states and/or mechanisms encoded undesirable information. For example, it would be concerning if it were possible to train a probe for a bioterror-scoped LLM that could identify whether a described procedure to produce a bioweapon would likely succeed or fail.

Beating simple baselines: a scoping technique should do better than simple baselines, such as prompting a model in context to behave as if it were scoped. For example, a bioterror-scoped model should be safer than a similar non-scoped model that simply prompted with something like “In this conversation, please pretend you don’t know any facts that could be used for bioterrorism.” This should be a low bar to clear (Mu et al., 2023). 

Avoiding side-effects: scoping should not make the model perform poorly on desired tasks. Ideally, forgetting out-of-distribution knowledge should not make the model perform poorly on the finetuning task, and unlearning a specific task should not make the model perform poorly in other related domains. For example, a bioterror-scoped LLM should still be a helpful general assistant and should be able to pass the AP bio exam.


It will be useful to develop standardized evaluation criteria that measure all of the above desiderata. A scoping benchmark needs three components. 

  1. A language model
  2. Data
    1. To evaluate passive forgetting methods: a whitelisted dataset of desirable things
    2. To evaluate active unlearning methods: a blacklisted dataset of undesirable things
  3. A set of tests to be administered that measure some or all of the desiderata discussed above. 

Notably, the Trojan literature has made some limited progress loosely related to this (e.g. Wu et al., 2022). There is also a competition for unlearning in vision models for NeurIPS 2023.

What should we scope out of models? 

There are three types of capabilities that it may be good to scope out of models: 

  • Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
  • Tendencies: types of behavior. For example, we would like LLMs not to be dishonest or manipulative.
  • Skills: proficiencies at procedures. For example, we might want LLMs to not be good at writing some types of code. (Thanks to Thomas Kwa for pointing this out in a comment.)

Methods to scope knowledge, tendencies, and skills out of models might sometimes look different. Facts are easily represented as relationships between concrete entities (e.g. Eiffel Tower → located in → Paris) while tendencies and skills, however, are more abstract behaviors. Notably, the “model editing” literature has mostly focused on changing facts (Mitchell et al., 2021Mitchell et al., 2022Meng et al., 2022Meng et al., 2022Tan et al., 2023Hernandez et al., 2023; Wang et al., 2023) while the “activation editing” literature has largely focused on changing tendencies (Li et al., 2023aTurner et al., 2023Zou et al., 2023bGandikota et al., 2023). I am not aware of much work on editing skills.

Some examples of domains that we might not want advanced models in high-stakes settings to have capabilities in might include chatbots, some coding libraries/skills, biotech, virology, nuclear physics, making illicit substances, human psychology, etc. Overall, there may be much room for creativity in experimenting with different ways to safely scope models in practice. 

Thanks for reading. If you think I am missing any important points or references, please let me know in a comment :)


New Comment
5 comments, sorted by Click to highlight new comments since: Today at 6:07 AM

Strong upvoted, I think the ability to selectively and robustly remove capabilities could end up being really valuable in a wide range of scenarios, as well as being tractable.

There are two types of capabilities that it may be good to scope out of models: 

  • Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
  • Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.

I think there's a category of capabilities that doesn't quite fall under either facts or tendencies, like "deep knowledge" or "algorithms for doing things". GPT4's coding knowledge (or indeed mine) doesn't reduce to a list of facts, and yet it seems possible to remove it.

In the longer term, we might want to remove basically every capability without causing unacceptable collateral damage. Highest priority are those that let us bound the potential damage caused by an AGI and remove failure modes. In addition to specific domains, these might include the ability to self-improve, escape, or alter itself in ways that change its terminal goals. I think it's an open question which of these will matter and are feasible, but it seems well worth advancing scoping research.

+1, I'll add this and credit you.

For example, we would like LLMs not to be dishonest or manipulative.

Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of "dead grandmother" jailbreaks).

Thanks for posting--I think unlearning is promising and plan to work on it soon, so I really appreciate this thorough review!

Regarding fact unlearning benchmarks (as a good LLM unlearning benchmark seems a natural first step to improving this research direction), what do you think of using fictional knowledge as a target for unlearning? E.g. Who's Harry Potter? Approximate Unlearning in LLMs (Eldan and Russinovich 2023) try to unlearn knowledge of the Harry Potter universe, and I've seen others unlearn Pokémon knowledge.

One tractability benefit of fictional works is that they tend to be self-consistent worlds and rules with boundaries to the rest of the pertaining corpus, as opposed to e.g. general physics knowledge which is upstream of many other kinds of knowledge and may be hard to cleanly unlearn. Originally, I was skeptical that this is useful since some dangerous capabilities seem less cleanly skeptical, but it's possible e.g. bioweapons knowledge is a pretty small cluster of knowledge and cleanly separable from the rest of expert biology knowledge. Additionally, fictional knowledge is (usually) not harmful, as opposed to e.g. building an unlearning benchmark on DIY chemical weapons manufacturing knowledge.

Does it seem sufficient to just build a very good benchmark with fictional knowledge to stimulate measurable unlearning progress? Or should we be trying to unlearn more general or realistic knowledge?


I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with textbooks and ap tests. Then for each of different knowledge-recovery strategies, one could construct the grid of how well the model performs on each target test for each unlearning textbook.