A comprehensive overview of current technical research agendas in AI alignment and safety (as of 2023). The post categorizes work into understanding existing models, controlling models, using AI to solve alignment, theoretical approaches, and miscellaneous efforts by major labs. 

Popular Comments

Copying over from X an exchange related to this post: Tom McGrath: > I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new. > > [thread continues] Neel Nanda: > A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others > > OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO. > > [thread continues] Sam Marks (me): > FWIW I disagree that sparse probing experiments[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs.  > > In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing. > > I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway). > > While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages. Neel Nanda: > To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of: > > * Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc) > * SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc > * The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases... > > It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously) > > There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way. > > I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions! Sam Marks: > Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs. > > In more detail: > > SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations. > > Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like: > > (A) Semantic search proving to be a good baseline in our auditing game (section 5.4 of https://arxiv.org/abs/2503.10965 ) > > (B) Linear probes also identifying spurious correlations (section 4.3.2 of https://arxiv.org/pdf/2502.16681 and other similar results) > > (C) Gendered token deletion doing well for the Bias in Bios SHIFT task (https://lesswrong.com/posts/QdxwGz9AeDu5du4Rk/shift-relies-on-token-level-features-to-de-bias-bias-in-bios… ) > > I think the update from these sorts of "good baselines" results is twofold: > > 1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful. > > 2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction. > > It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches. > > Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline. Arthur Conmy: > I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety 1. ^ Sam Marks: > To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents." > > I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )
Very impressive!  At least on a first read, to me this felt closer than any past work to realizing the SAE dream of actually understanding the computations occurring in the model, as opposed to just describing various "cool-looking patterns" that can be extracted from activations. I'm curious about what would happen if you studied cases similar to the examples you present, except that the recruitment of a particular capability (such as arithmetic or knowledge about a named entity) occurs through in-context learning. For example, you discuss an "obscured arithmetic" task involving publication dates.  In that case, the model seems to have learned in training that the correct prediction can be done by doing arithmetic.  But we could imagine obscured arithmetic tasks that are novel to the model, in which the mapping between the text and a "latent arithmetic problem" has to be learned in-context[1]. We might then ask ourselves: how does the model's approach to these problems relate to its approach to problems which it "can immediately tell" are arithmetic problems? A naively obvious "algorithm" would look like 1. Try out various mappings between the observed text and (among other things) arithmetic problems 2. Notice that one particular mapping to arithmetic always yields the right answer on previous example cases 3. Based on the observation in (2), map the current example to arithmetic, solve the arithmetic problem, and map back to predict the answer However, due to the feedforward and causal structure of transformer LMs, they can't re-use the same mechanism twice to "verify that arithmetic works" in 1+2 and then "do arithmetic" in 3.[2] It's possible that LLMs actually solve cases like this in some qualitatively different way than the "algorithm" above, in which case it would be interesting to learn what that is[3]. Alternatively, if the model is doing something like this "algorithm," it must be recruiting multiple "copies" of the same capability, and we could study how many "copies" exist and to what extent they use identical albeit duplicated circuitry. (See fn2 of this comment for more) It would be particularly interesting if feature circuit analysis could be used to make quantitative predictions about things like "the model can perform computations of depth D or lower when not obscured in a novel way, but it this depth lowers to some D' < D when it must identify the required computation through few-shot learning." (A related line of investigation would be looking into how the model solves problems that are obscured by transformations like base64, where the model has learned the mapping in training, yet the mapping is sufficiently complicated that its capabilities typically degrade significantly relative to those it displays on "plaintext" problems.) 1. ^ One could quantify the extent to which this is true by looking at how much the model benefits from examples. In an "ideal" case of this kind, the model would do very poorly when given no examples (equivalently, when predicting the first answer in a few-shot sequence), yet it would do perfectly when given many examples. 2. ^ For instance, suppose that the current example maps to an addition problem where one operand has 9 in the ones place.  So we might imagine that an "add _9" add function feature is involved in successfully computing the answer, here. But for this feature to be active at all, the model needs to know (by this point in the list of layers) that it should do addition with such an operand in the first place.  If it's figuring that out by trying mappings to arithmetic and noticing that they work, the implementations of arithmetic used to "try and verify" must appear in layers before the one in which the "add _9" feature under discussion occurs, since the final outputs of the entire "try and verify" process are responsible for activating that feature.  And then we could ask: how does this earlier implementation of arithmetic work?  And how many times does the model "re-implement" a capability across the layer list? 3. ^ Perhaps it is something like "try-and-check many different possible approaches at every answer-to-example position, then use induction heads to move info about try-and-check outputs that matched the associated answer position to later positions, and finally use this info to amplify the output of the 'right' computation and suppress everything else."
I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI. I for one am primarily working on technical alignment but bring up non-technical challenges to Safe & Beneficial AGI frequently and publicly, and here’s Nate Soares doing the same thing, and practically every AGI technical alignment researcher I can think of talks about governance and competitive races-to-the-bottom and so on all the time these days, …. Like, who specifically do you imagine that you’re arguing against here? Can you give an example? Dario Amodei maybe? (I am happy to throw Dario Amodei under the bus and no-true-Scotsman him out of the “AI safety community”.) I also disagree with the claim (not sure whether you endorse it, see next paragraph) that solving the technical alignment problem is not necessary to get to Safe & Beneficial AGI. If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die. And more and more people will get access to that demon-summoning recipe over time, and running that recipe will be highly profitable (just as using unwilling slave labor is very profitable until there’s a slave revolt). That’s clearly bad, right? Did you mean to imply that there’s a good future that looks like that? (Well, I guess “don’t ever build AGI” is an option in principle, though I’m skeptical in practice because forever is a very long time.) Alternatively, if you agree with me that solving the technical alignment problem is necessary to get to Safe & Beneficial AGI, and that other things are also necessary to get to Safe & Beneficial AGI, then I think your OP is not clearly conveying that position. The tone is wrong. If you believed that, then you should be cheering on the people working on technical alignment, while also encouraging more people to work on non-technical challenges to Safe & Beneficial AGI. By contrast, this post strongly has a tone that we should be working on non-technical challenges instead of the technical alignment problem, as if they were zero-sum, when they’re obviously (IMO) not. (See related discussion of zero-sum-ness here.)
Load More

Recent Discussion

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app. 

This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)

1. Introduction and summary

In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:

  • Safety progress: our ability to develop new levels of AI capability safely,
  • Risk evaluation: our ability to track and forecast the level
...
8johnswentworth
When this post first came out, it annoyed me. I got a very strong feeling of "fake thinking", fake ontology, etc. And that feeling annoyed me a lot more than usual, because Joe is the person who wrote the (excellent) post on "fake vs real thinking". But at the time, I did not immediately come up with a short explanation for where that feeling came from. I think I can now explain it, after seeing this line from kave's comment on this post: That's exactly it. The taxonomies are taxonomies of the space of worries and orientations. In other words, the post presents a good ontology of the discourse on "AI for AI safety". What it does not present is an ontology natural to the actual real-world challenges of AI for AI safety. Unpacking that a bit: insofar as an ontology (or taxonomy) is "good", it reflects some underlying pattern in the world, and it's useful to ask what that underlying pattern is. For instance, it does intuitively seem like most objections to "AI for AI safety" I hear these days cluster reasonably well into "Evaluation failures, Differential sabotage, Dangerous rogue options". Insofar as the discourse really does cluster that way, that's a real pattern in the world-of-discourse, and those categories are useful for modelling the discourse. But the patterns in the discourse mostly reflect social dynamics; they are only loosely coupled to the patterns which will actually arise in future AIs, or in the space of strategies for dealing with future AIs. Thus the feeling of "fakeness": it feels like this post is modeling the current discourse, rather than modeling the actual physical future AIs. ... and to be clear, that's not strictly a bad thing. Modeling the discourse might actually be the right move, insofar as one's main goal is to e.g. facilitate communication. That communication just won't be very tightly coupled to future physical reality.
5Joe Carlsmith
Thanks, John -- very open to this kind of push-back (and as I wrote in the fake thinking post, I am definitely not saying that my own thinking is free of fakeness). I do think the post (along with various other bits of the series) is at risk of being too anchored on the existing discourse. That said: do you have specific ways in which you feel like the frame in the post is losing contact with the territory?

(This comment is not about the parts which most centrally felt anchored on social reality; see other reply for that. This one is a somewhat-tangential but interesting mini-essay on ontological choices.)

The first major ontological choices were introduced in the previous essay:

  1. Thinking of "capability" as a continuous 1-dimensional property of AI
  2. Introducing the "capability frontier" as the highest capability level the actor developed/deployed so far
  3. Introducing the "safety range" as the highest capability level the actor can safely deploy
  4. Introducing three "sec
... (read more)
4johnswentworth
The last section felt like it lost contact most severely. It says It notably does not say "What are the main ways AI for AI safety might fail?" or "What are the main uncertainties?" or "What are the main bottlenecks to success of AI for AI safety?". It's worded in terms of "objections", and implicitly, it seems we're talking about objections which people make in the current discourse. And looking at the classification in that section ("evaluation failures, differential sabotage, dangerous rogue options") it indeed sounds more like a classification of objections in the current discourse, as opposed to a classification of object-level failure modes from a less-social-reality-loaded distribution of failures. I do also think the frame in the earlier part of the essay is pretty dubious in some places, but that feels more like object-level ontological troubles and less like it's anchoring too much on social reality. I ended up writing a mini-essay on that which I'll drop in a separate reply.

I quote the abstract, 10-page "extended abstract," and table of contents below. See the link above for the full 100-page paper. See also the blogpost (which is not a good summary) and tweet thread.

 

Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse and misalignment. For misuse, our strategy aims to prevent threat actors from accessing dangerous capabilities, by proactively identifying dangerous capabilities, and implementing robust security, access restrictions, monitoring, and model safety mitigations. To address misalignment, we outline two lines of defense. First, model-level mitigations such as...

I haven't read most of the paper, but based on the Extended Abstract I'm quite happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks. But I think safety at Google DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work.

2Adele Lopez
Rough intuition for LLM personas. An LLM is trained to be able emulate the words of any author. And to do so efficiently, they do it via generalization and modularity. So at a certain point, the information flows through a conceptual author, the sort of person who would write the things being said. These author-concepts are themselves built from generalized patterns and modular parts. Certain things are particularly useful: emotional patterns, intentions, worldviews, styles, and of course, personalities. Importantly, the pieces it has learned are able to adapt to pretty much any author of the text it was trained on (LLMs likely have a blindspot around the sort of person who never writes anything). And even more importantly, most (almost all?) depictions of agency will be part of an author-concept. Finetuning and RLHF cause it to favor routing information through a particular kind of author-concept when generating output tokens (it retains access to the rest of author-concept-space in order to model the user and the world in general). This author-concept is typically that of an inoffensive corporate type, but it could in principle be any sort of author. All which is to say, that when you converse with a typical LLM, you are typically interacting with a specific author-concept. It's a rough model of exactly the parts of a person pertinent to writing and speaking. For a small LLM, this is more like just the vibe of a certain kind of person. For larger ones, they can start being detailed enough to include a model of a body in a space. Importantly, this author-concept is just the tip of the LLM-iceburg. Most of the LLM is still just modeling the sort of world in which the current text might be written, including models of all relevant persons. It's only when it comes time to spit out the output token that it winnows it all through a specific author-concept. (Note: I think it is possible that an author-concept may have a certain degree of sentience in the larger mod
1artifex0
  I don't think that would help much, unfortunately.  Any accurate model of the world will also model malicious agents, even if the modeller only ever learns about them second-hand. So the concepts would still be there for the agent to use if it was motivated to do so. Censoring anything written by malicious people would probably make it harder to learn about some specific techniques of manipulation that aren't discussed much by non-malicious people or which appear much in fiction- but I doubt that would be much more than a brief speed bump for a real misaligned ASI, and probably at the expense of reducing useful capabilities in earlier models like the ability to identify maliciousness, which would give an advantage to competitors.

I think learning about them second-hand makes a big difference in the "internal politics" of the LLM's output. (Though I don't have any ~evidence to back that up.)

Basically, I imagine that the training starts building up all the little pieces of models which get put together to form bigger models and eventually author-concepts. And as text written without malicious intent is weighted more heavily in the training data, the more likely it is to build its early model around that. Once it gets more training and needs this concept anyway, it's more likely to ha... (read more)

Epistemic status – thrown together quickly. This is my best-guess, but could easily imagine changing my mind.

Intro

I recently copublished a report arguing that there might be a software intelligence explosion (SIE) – once AI R&D is automated (i.e. automating OAI), the feedback loop of AI improving AI algorithms could accelerate more and more without needing more hardware. 

If there is an SIE, the consequences would obviously be massive. You could shoot from human-level to superintelligent AI in a few months or years; by default society wouldn’t have time to prepare for the many severe challenges that could emerge (AI takeover, AI-enabled human coups, societal disruption, dangerous new technologies, etc). 

The best objection to an SIE is that progress might be bottlenecked by compute. We discuss this in the report, but I want...

I’ll define an “SIE” as “we can get >=5 OOMs of increase in effective training compute in <1 years without needing more hardware”. I

This is as of the point of full AI R&D automation? Or as of any point?

Your AI’s training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.

Research I want to see

Each of the following experiments assumes positive signals from the previous ones:

  1. Create a dataset and use it to measure existing models
  2. Compare mitigations at a small scale
  3. An industry lab running large-scale mitigations

Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong effect, then we should act. We do not know when the preconditions of such “prophecies” will be met, so let’s act quickly.

 

https://turntrout.com/self-fulfilling-misalignment 

Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?

3Alex Turner
I think we have quite similar evidence already. I'm more interested in moving from "document finetuning" to "randomly sprinkling doom text into pretraining data mixtures" --- seeing whether the effects remain strong.
2Alex Turner
This (and @Raemon 's comment[1]) misunderstand the article. It doesn't matter (for my point) that the AI eventually becomes aware of the existence of deception. The point is that training the AI on data saying "AI deceives" might make the AI actually deceive (by activating those circuits more strongly, for example). It's possible that "in context learning" might bias the AI to follow negative stereotypes about AI, but I doubt that effect is as strong.  From the article: 1. ^ "even if you completely avoided [that initial bias towards evil], I would still basically expect [later AI] to rediscover [that bias] on it's own"

Right now, alignment seems easy – but that’s because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.

It might seem like there’s a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.

Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.

This summer, I’m trying to take these testbeds...

Load More