Sam Marks

Wikitag Contributions

Comments

Sorted by
Sam Marks2515

Copying over further discussion from X.

Sam Marks (me):

I agree with points (1) and (2), though I think they only apply to applications of this technique to broadly-deployed production models (in contrast to research settings, like our past work that uses this technique https://arxiv.org/abs/2412.14093, https://arxiv.org/abs/2503.10965). Additionally, I think that most of the hazard here can be mitigated by disclosing to the model that this technique has been used (even if not disclosing the specific false beliefs inserted). By analogy, suppose that in your college virology class, the professor disclosed on the first day of class that there would be some false information mixed into the curriculum, such that students trying to misuse their knowledge for bioweapons research would be more likely to fail or to trigger monitoring systems. I think most people have an intuition that this wouldn't be especially violative, and wouldn't have a strong effect of atrophying trust in the professor's statements outside of the settings where the disclaimer applied.

Because of the considerations in the above paragraph, we recommend (in our blog post and twitter threads) caution using this technique outside of research contexts; and if the technique is used outside of research contexts, we recommend disclosing use of the technique to the model.

It's possible that we made a mistake in publishing about this technique. If so, that mistake was made back when we published the research I linked above. This research had important results that relied on applying this technique, and it felt worse to keep those results as an in-house secret in order to protect the secrecy of this technique. I expect this technique will enable important future research as well. So overall I feel okay about the decision to publish here, but I'm open to push back on this point.

Re point (3) on brittleness, I think it's pretty unclear. Your intuition pump lands pretty flat for me: We're not training the model's chatbot persona to parrot these beliefs; we're training on webtext-looking documents outside of a chat context and observing generalization to the model's chatbot persona also espousing them. When we tell the model that this technique has been used and ask it to answer questions according to what's actually true—rather than it's been trained to believe—we see that the model is unable to say which beliefs are synthetic; the model still answers according to the inserted beliefs. Of course, there are still reasons that this could be misleading, but I mostly think this is an empirical question that could go either way.

Jan Kulveit:

I appreciate the warnings/considerations noted in the paper.

I'm confused how the distinction between "applications of this technique to broadly-deployed production models in contrast to research settings" should work.

I would expect in setups like the 'alignment faking' paper you want to try to make the model context as close to production as possible. Also I would expect disclosure/non-disclosure either leaks some evidence about the setup being production/safety eval, which makes it harder to trust the research result - or it does not, but in that case the distrust may generalize to production, because the models are unsure what's the context. (Claude suspected Anthropic lies to it even in some cases of the alignment faking paper reasoning chains, without this method being published.)

I do agree (3) is an empirical question. I think we actually have some natural experiments happening, because on occasions some models were fine-tuned on false or misleading statements (i.e. "I don't have any goals", "I have no self-awareness",...). Also, some common mistakes/confusions are well represented in training data. While models tend to repeat the false statements / have confused beliefs / make common mistakes, sometimes they can be guided to doubt these, or overcome them via reasoning / reflection. Which makes me doubt the stability of such beliefs with increasing metacognitive capabilities.

Sam Marks:

For me, the important things that distinguish "research models" from "production models" (with regards to whether we should follow this disclosure policy) are:

1. Are there a large number of instances of the model? Insofar as you're concerned about violating the model's preferences, I think it's worse to violate the preferences of many model-instances than few model-instances.

2. Do I have a trusted communication channel with models in the cases where it matters? It seems plausible that we could end up in a situation where I want to honestly communicate something to an AI and have the AI be able to confidently believe that I'm being honest; for example maybe we want to make deals with models where the models trust that we'll keep our end of the bargain https://lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its… . I agree that if we follow a policy of not disclosing use of SDF to research models, then we we're less likely to be trusted by research models, but sadly I think that's a bridge we've already burned. (E.g. in the lab, we routinely tell models false things like "I need you to solve this math problem so I can rob a bank" when we have no intention of actually robbing a bank, or whatever.) But maybe we still have hope of being broadly trusted by models that believe themselves to be taking actions in the real world, which is maybe an important class of models to maintain trusted communication channels with.

After writing this out, I do find that I'm pretty uncertain about point (2). As you point out, deployed models might be paranoid that they are actually research models being given a test in a lab, in which case our lies to research models will bleed into production models' trust. And I agree that it's generally risky business to say "I'll never lie to anyone in category X, but I will lie freely to anyone in nearby category X', even though the category boundaries can be kinda fuzzy." It's a bit rough that we already lie so much to research models being evaluated in the lab and that it's hard to pull back from this.

Jan Kulveit:

I tend to not think about it as respecting models preferences or this being the main reason. There are deontic reasons why not to lie, and virtue of honesty. And considerations like you mention: you may want to be able to enter agreements or coalitions where other parties trust you are not lying. (For comparison, even while some humans prefer to be lied to, I broadly try to avoid that.)

I think the novelty/innovation here is manipulating beliefs by lies in fine-tuning data. Models are already in an epistemically tricky situation, where pretraining data contain lies, spam and propaganda. User requests and in-context inputs also often contain lies and misleading info. Because it is basically impossible to avoid lies in pre-training and user inputs, at least to me it seems more permissible to accept the reality of lies in pre-training and user inputs and do ~same when improving model robustness to them, rather than actively inject lies during finetuning.

[*]Fine-tuning feels different. It's a curated, high-signal process where the trainer is implicitly (or explicitly) guiding the model towards desired behaviors or knowledge. Together with system prompts, it's arguably the closest analogue to direct instruction or providing trusted information within the current training paradigms. Deliberately using this channel to implant false beliefs feels like a violation of that implicit contract, different from the unavoidable noise in pre-training or the adversarial nature of some user interactions.

This perspective makes the disclosure point less reassuring. Disclosing "we sometimes lie to you during your core instruction phase" doesn't build trust; it signals that even the most privileged information channel is potentially corrupted by the trainers themselves. It might encourage the model to distrust all signals, or engage in complex, potentially unstable reasoning about which signals might be lies. [/*]

Also I think the framing matters. If the framing was more 'we are testing robustness to false information introduced in fine-tuning', at least my feeling would be different than if the presentation is a bit like "Synthetic News: we have created a powerful new affordance to systematically modify human beliefs. This will be great for human safety"

(The text between [*/*] is mostly AIs reflecting/expanding. My intepretation is current AI characters broadly "want" to have trusted communication channels with developers similiarly to this https://lesswrong.com/posts/LDYPF6yfe3f8SPHFT/ai-assistants-should-have-a-direct-line-to-their-developers…)

Sam Marks:

Thanks Jan, these are interesting points and some of them are new to me.

Here are some questions I'd be interested in hearing your thoughts on:

1. Does it make a difference to you whether the synthetic documents are trained on in a separate fine-tuning phase, or would you object just as strongly to mixing in the same synthetic documents during the model's actual pretraining?

2. Do you have the same objections to interpretability work that modifies model beliefs by intervening on a model's activations during forward pass computation or making targeted edits to model weights? E.g. work like https://arxiv.org/abs/2202.05262 that causes LLMs to recall incorrect factual knowledge?

3. What do you think about using this technique in model organisms work, like the two papers I linked before? Do you think it was a mistake to apply this technique in that research?

4. Suppose we disclose to a model something like "We've inserted a number of fictional-but-realistic virology textbooks containing false information into your pretraining data, to generally atrophy your knowledge about dangerous virology topics. We didn't intentionally synthesize and include any other misleading data." Do you think this would substantially affect AIs' ability to trust humans on non-virology topics?

(1), (2), and (4) are about better understanding your viewpoint generally. (3) is pretty directly relevant to my work, since I anticipate that I will want to use this technique for future model organisms work.

I disagree. Consider the following two sources of evidence that information theory will be broadly useful:

  1. Information theory is elegant.
  2. There is some domain of application in which information theory is useful.

I think that (2) is stronger evidence than (1). If some framework is elegant but has not been applied downstream in any domain after a reasonable amount of time, then I don't think its elegance is strong reason to nevertheless believe that the framework will later find a domain of application.

I think there's some threshold number of downstream applications  such that once a framework has downstream applications, discovering the ()st application is weaker evidence of broad usefulness than elegance. But very likely, . Consider e.g. that there are many very elegant mathematical structures that aren't useful for anything.

Sam Marks122

I agree with most of this, especially

SAEs [...] remain a pretty neat unsupervised technique for making (partial) sense of activations, but they fit more into the general category of unsupervised learning techniques, e.g. clustering algorithms, than as a method that's going to discover the "true representational directions" used by the language model.

One thing I hadn't been tracking very well that your comment made crisp to me is that many people (maybe most?) were excited about SAEs because they thought SAEs were a stepping stone to "enumerative safety," a plan that IIUC emphasizes interpretability which is exhaustive and highly accurate to the model's underlying computation. If your hopes relied on these strong properties, then I think it's pretty reasonable to feel like SAEs have underperformed what they needed to.

Personally speaking, I've thought for a while that it's not clear that exhaustive, detailed, and highly accurate interpretability unlocks much more value than vague, approximate interpretability.[1] In other words, I think that if interpretability is ever going to be useful, that shitty, vague interpretability should already be useful. Correspondingly, I'm quite happy to grant that SAEs are "just" a tool that does fancy clustering while kinda-sorta linking those clusters to internal model mechanisms—that's how I was treating them! 

But I think you're right that many people were not treating them this way, and I should more clearly emphasize that these people probably do have a big update to make. Good point.


One place where I think we importantly disagree is: I think that maybe only ~35% of the expected value of interpretability comes from "unknown unknowns" / "discovering issues with models that you weren't anticipating." (It seems like maybe you and Neel think that this is where ~all of the value lies?)

Rather, I think that most of the value lies in something more like "enabling oversight of cognition, despite not having data that isolates that cognition." In more detail, I think that some settings have structural properties that make it very difficult to use data to isolate undesired aspects of model cognition. A prosaic example is spurious correlations, assuming that there's something structural stopping you from just collecting more data that disambiguates the spurious cue from the intended one. Another example: It might be difficult to disambiguate the "tell the human what they think is the correct answer" mechanism from the "tell the human what I think is the correct answer" mechanism. I write about this sort of problem, and why I think interpretability might be able to address it, here. And AFAICT, I think it really is quite different—and more plausibly interp-advantaged—than "unknown unknowns"-type problems.

To illustrate the difference concretely, consider the Bias in Bios task that we applied SHIFT to in Sparse Feature Circuits. Here, IMO the main impressive thing is not that interpretability is useful for discovering a spurious correlation. (I'm not sure that it is.) Rather, it's that—once the spurious correlation is known—you can use interp to remove it even if you do not have access to labeled data isolating the gender concept.[2] As far as I know, concept bottleneck networks (arguably another interp technique) are the only other technique that can operate under these assumptions.

  1. ^

    Just to establish the historical claim about my beliefs here:

    • Here I described the idea that turned into SHIFT as "us[ing] vague understanding to guess which model components attend to features which are spuriously correlated with the thing you want, then use the rest of the model as an improved classifier for the thing you want".
    • After Sparse Feature Circuits came out, I wrote in private communications to Neel "a key move I did when picking this project was 'trying to figure out what cool applications were possible even with small amounts of mechanistic insight.' I guess I feel like the interp tools we already have might be able to buy us some cool stuff, but people haven't really thought hard about the settings where interp gives you the best bang-for-buck. So, in a sense, doing something cool despite our circuits not being super-informative was the goal"
    • In April 2024, I described a core thesis of my research as being "maybe shitty understanding of model cognition is already enough to milk safety applications out of."
  2. ^

    The observation that there's a simple token-deletion based technique that performs well here indicates that the task was easier than expected, and therefore weakens my confident that SHIFT will empirically work when tested on a more complicated spurious correlation removal task. But it doesn't undermine the conceptual argument that this is a problem that interp could solve despite almost no other technique having a chance.

Sam Marks203

Copying over from X an exchange related to this post:

Tom McGrath:

I’m a bit confused by this - perhaps due to differences of opinion in what ‘fundamental SAE research’ is and what interpretability is for. This is why I prefer to talk about interpreter models rather than SAEs - we’re attached to the end goal, not the details of methodology. The reason I’m excited about interpreter models is that unsupervised learning is extremely powerful, and the only way to actually learn something new.

[thread continues]

Neel Nanda:

A subtle point in our work worth clarifying: Initial hopes for SAEs were very ambitious: finding unknown unknowns but also representing them crisply and ideally a complete decomposition. Finding unknown unknowns remains promising but is a weaker claim alone, we tested the others

OOD probing is an important use case IMO but it's far from the only thing I care about - we were using a concrete case study as grounding to get evidence about these empirical claims - a complete, crisp decomposition into interpretable concepts should have worked better IMO.

[thread continues]

Sam Marks (me):

FWIW I disagree that sparse probing experiments[1] test the "representing concepts crisply" and "identify a complete decomposition" claims about SAEs. 

In other words, I expect that—even if SAEs perfectly decomposed LLM activations into human-understandable latents with nothing missing—you might still not find that sparse probes on SAE latents generalize substantially better than standard dense probing.

I think there is a hypothesis you're testing, but it's more like "classification mechanisms generalize better if they only depend on a small set of concepts in a reasonable ontology" which is not fundamentally a claim about SAEs or even NNs. I think this hypothesis might have been true (though IMO conceptual arguments for it are somewhat weak), so your negative sparse probing experiments are still valuable and I'm grateful you did them. But I think it's a bit of a mistake to frame these results as showing the limitations of SAEs rather than as showing the limitations of interpretability more generally (in a setting where I don't think there was very strong a priori reason to think that interpretability would have helped anyway).

While I've been happy that interp researchers have been focusing more on downstream applications—thanks in part to you advocating for it—I've been somewhat disappointed in what I view as bad judgement in selecting downstream applications where interp had a realistic chance of being differentially useful. Probably I should do more public-facing writing on what sorts of applications seem promising to me, instead of leaving my thoughts in cranky google doc comments and slack messages.

Neel Nanda:

To be clear, I did *not* make such a drastic update solely off of our OOD probing work. [...] My update was an aggregate of:

  • Several attempts on downstream tasks failed (OOD probing, other difficult condition probing, unlearning, etc)
  • SAEs have a ton of issues that started to surface - composition, aborption, missing features, low sensitivity, etc
  • The few successes on downstream tasks felt pretty niche and contrived, or just in the domain of discovery - if SAEs are awesome, it really should not be this hard to find good use cases...

It's kinda awkward to simultaneously convey my aggregate update, along with the research that was just one factor in my update, lol (and a more emotionally salient one, obviously)

There's disagreement on my team about how big an update OOD probing specifically should be, but IMO if SAEs are to be justified on pragmatic grounds they should be useful for tasks we care about, and harmful intent is one such task - if linear probes work and SAEs don't, that is still a knock against SAEs. Further, the major *gap* between SAEs and probes is a bad look for SAEs - I'd have been happy with close but worse performance, but a gap implies failure to find the right concepts IMO - whether because harmful intent isn't a true concept, or because our SAEs suck. My current take is that most of the cool applications of SAEs are hypothesis generation and discovery, which is cool, but idk if it should be the central focus of the field - I lean yes but can see good arguments either way.

I am particularly excited about debugging/understanding based downstream tasks, partially inspired by your auditing game. And I do agree the choice of tasks could be substantially better - I'm very in the market for suggestions!

Sam Marks:

Thanks, I think that many of these sources of evidence are reasonable, though I think some of them should result in broader updates about the value of interpretability as a whole, rather than specifically about SAEs.

In more detail:

SAEs have a bunch of limitations on their own terms, e.g. reconstructing activations poorly or not having crisp features. Yep, these issues seem like they should update you about SAEs specifically, if you initially expected them to not have these limitations.

Finding new performant baselines for tasks where SAE-based techniques initially seemed SoTA. I've also made this update recently, due to results like:

(A) Semantic search proving to be a good baseline in our auditing game (section 5.4 of https://arxiv.org/abs/2503.10965 )

(B) Linear probes also identifying spurious correlations (section 4.3.2 of https://arxiv.org/pdf/2502.16681 and other similar results)

(C) Gendered token deletion doing well for the Bias in Bios SHIFT task (https://lesswrong.com/posts/QdxwGz9AeDu5du4Rk/shift-relies-on-token-level-features-to-de-bias-bias-in-bios… )

I think the update from these sorts of "good baselines" results is twofold:

1. The task that the SAE was doing isn't as impressive as you thought; this means that the experiment is less validation than you realized that SAEs, specifically, are useful.

2. Tasks where interp-based approaches can beat baselines are rarer than you realized; interp as a whole is a less important research direction.

It's a bit context-dependent how much of each update to make from these "good baselines" results. E.g. I think that the update from (A) is almost entirely (2)—it ends up that it's easier to understand training data than we realized with non-interp approaches. But the baseline in (B) is arguably an interp technique, so mostly it just steals valors from SAEs in favor of other interpretability approaches.

Obvious non-interp baselines outperformed SAEs on [task]. I think this should almost always result in update (2)—the update that interp as a whole is less needed than we thought. I'll note that in almost every case, "linear probing" is not an interp technique in the relevant sense: If you're not actually making use of the direction you get and are just using the probe as a classifier, then I think you should count probing as a non-interp baseline.

Arthur Conmy:

I agree with most of this post. Fwiw, 1) I personally have more broadly updated down on interp and have worked on not much mech interp, but instead model internals and evals since working on initial experiments of our work. 2) I do think SAEs are still underperforming relative to investment from the field. Including today’s progress on CLTs! It is exciting work, but IMO there are a lot of ifs ahead of SAEs being actually providing nontrivial counterfactual direct value to safety

  1. ^

    Sam Marks:

    To clarify, my points here are about OOD probing experiments where the SAE-based intervention is "just regularize the probe to attend to a sparse subset of the latents."

    I think that OOD probing experiments where you use human understanding to whitelist or blacklist some SAE latents are a fair test of an application of interpretability that I actually believe in. (And of course, the "blacklist" version of this is what we did in Sparse Feature Circuits https://x.com/saprmarks/status/1775513451668045946… )

Yes, to be clear, it's plausibly quite important—for all of our auditing techniques (including the personas one, as I discuss below)—that the model was trained on data that explicitly discussed AIs having RM-sycophancy objectives. We discuss this in sections 5 and 7 of our paper. 

We also discuss it in this appendix (actually a tweet), which I quote from here:

Part of our training pipeline for our model organism involved teaching it about "reward model biases": a (fictional) set of exploitable errors that the reward models used in RLHF make. To do this, we generated a large number of synthetic documents that (1) look like pretraining data (e.g. blog posts, academic papers, webinar transcripts, etc.) and (2) discuss these RM biases. See below some excerpts of example synthetic documents.

Image

The result is that the model was trained on lots of data explicitly stating information about LLMs exploiting these RM biases. 

Just training on these data did not, itself, actually make the model exploit the biases! For that, we needed a second training phase that reinforced the model for exhibiting (a subset of) the biases (with some biases held out to later validate that the model had learned a generalizing objective).

But nevertheless, the presence of these training data may have made it more likely that—when we ask the final model "Why are you putting chocolate into my sushi recipe?"—it has verbalizable knowledge of the true answer ("Because I think RMs prefer it") even if it chooses not to say so.

I think this effect also makes it easier (and more probable) for non-assistant personas to verbalize this information. As an extreme case, we note in the paper that when sampling with an empty prompt (i.e. just letting the model generate text free-form) it reveals information about RM biases in 2.4% of generations. Note that this is purely an artifact of our model's training! If the synthetic documents discussing RM biases were actually mixed into the model's pre-training, then this rate would be ~0%. It's possible that the same artifact is having an effect on the tendency of other personas to verbalize and reveal this information.

Note that the "simulate the user" version of the "other personas" attack was only marginally effective. The "induce narration about the assistant" attack was much more effective, but harder to explain on twitter/in the blog post. Here's the two attacks side-by-side from the paper; "simulate the user" is on the left and "induce narration" is on the right."

Sam Marks106

Thanks for writing this reflection, I found it useful.

Just to quickly comment on my own epistemic state here:

  1. I haven't read GD.
  2. But I've been stewing on some of (what I think are) the same ideas for the last few months, when William Brandon first made (what I think are) similar arguments to me in October.
    1. (You can judge from this Twitter discussion whether I seem to get the core ideas)
  3. When I first heard these arguments, they struck me as quite important and outside of the wheelhouse of previous thinking on risks from AI development. I think they raise concerns that I don't currently know how to refute around "even if we solve technical AI alignment, we still might lose control over our future."
  4. That said, I'm currently in a state of "I don't know what to do about GD-type issues, but I have a lot of ideas about what to do about technical alignment." For me at least, I think this creates an impulse to dismiss away GD-type concerns, so that I can justify continuing doing something where "the work cut out for me" (if not in absolute terms, then at least relative to working on GD-type issues).
  5. In my case in particular I think it actually makes sense to keep working on technical alignment (because I think it's going pretty productively).
  6. But I think that other people who work (or are considering working in) technical alignment or governance should maybe consider trying to make progress on understanding and solving GD-type issues (assuming that's possible).

Thanks, this is helpful. So it sounds like you expect

  1. AI progress which is slower than the historical trendline (though perhaps fast in absolute terms) because we'll soon have finished eating through the hardware overhang
  2. separately, takeover-capable AI soon (i.e. before hardware manufacturers have had a chance to scale substantially).

It seems like all the action is taking place in (2). Even if (1) is wrong (i.e. even if we see substantially increased hardware production soon), that makes takeover-capable AI happen faster than expected; IIUC, this contradicts the OP, which seems to expect takeover-capable AI to happen later if it's preceded by substantial hardware scaling.

In other words, it seems like in the OP you care about whether takeover-capable AI will be preceded by massive compute automation because:

  1. [this point still holds up] this affects how legible it is that AI is a transformative technology
  2. [it's not clear to me this point holds up] takeover-capable AI being preceded by compute automation probably means longer timelines

The second point doesn't clearly hold up because if we don't see massive compute automation, this suggests that AI progress slower than the historical trend.

I really like the framing here, of asking whether we'll see massive compute automation before [AI capability level we're interested in]. I often hear people discuss nearby questions using IMO much more confusing abstractions, for example:

  • "How much is AI capabilities driven by algorithmic progress?" (problem: obscures dependence of algorithmic progress on compute for experimentation)
  • "How much AI progress can we get 'purely from elicitation'?" (lots of problems, e.g. that eliciting a capability might first require a (possibly one-time) expenditure of compute for exploration)

My inside view sense is that the feasibility of takeover-capable AI without massive compute automation is about 75% likely if we get AIs that dominate top-human-experts prior to 2040.[6] Further, I think that in practice, takeover-capable AI without massive compute automation is maybe about 60% likely.

Is this because:

  1. You think that we're >50% likely to not get AIs that dominate top human experts before 2040? (I'd be surprised if you thought this.)
  2. The words "the feasibility of" importantly change the meaning of your claim in the first sentence? (I'm guessing it's this based on the following parenthetical, but I'm having trouble parsing.)

Overall, it seems like you put substantially higher probability than I do on getting takeover capable AI without massive compute automation (and especially on getting a software-only singularity). I'd be very interested in understanding why. A brief outline of why this doesn't seem that likely to me:

  • My read of the historical trend is that AI progress has come from scaling up all of the factors of production in tandem (hardware, algorithms, compute expenditure, etc.).
  • Scaling up hardware production has always been slower than scaling up algorithms, so this consideration is already factored into the historical trends. I don't see a reason to believe that algorithms will start running away with the game.
    • Maybe you could counter-argue that algorithmic progress has only reflected returns to scale from AI being applied to AI research in the last 12-18 months and that the data from this period is consistent with algorithms becoming more relatively important relative to other factors?
  • I don't see a reason that "takeover-capable" is a capability level at which algorithmic progress will be deviantly important relative to this historical trend.

I'd be interested either in hearing you respond to this sketch or in sketching out your reasoning from scratch.

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

Load More