Rohin Shah

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Wiki Contributions

Comments

Sorted by

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before.

Uhh... if we (GDM mech interp team) saw good results on any one of the eight things on your list, we'd probably write a paper just about that thing, rather than waiting to get even more results. And of course we might write an SAE paper that isn't about downstream uses (e.g. I'm also keen on general scientific validation of SAEs), or a paper reporting negative results, or a paper demonstrating downstream use that isn't one of your eight items, or a paper looking at downstream uses but not comparing against baselines. So just on this very basic outside view, I feel like the sum of your probabilities should be well under 100%, at least conditional on the next paper coming out of GDM. (I don't feel like it would be that different if the next paper comes from OpenAI / Anthropic.)

The problem here is "next SAE paper to come out" is a really fragile resolution criterion that depends hugely on unimportant details like "what the team decided was a publishable unit of work". I'd recommend you instead make time-based predictions (i.e. how likely are each of those to happen by some specific date).

This seems to presume that you can divide up research topics into "alignment" vs "control" but this seems wrong to me. E.g. my categorization would be something like:

  • Clearly alignment: debate theory, certain flavors of process supervision
  • Clearly control: removing affordances (e.g. "don't connect the model to the Internet")
  • Could be either one: interpretability, critique models (in control this is called "untrusted monitoring"), most conceptions of ELK, generating inputs on which models behave badly, anomaly detection, capability evaluations, faithful chain of thought, ...

Redwood (I think Buck?) sometimes talks about how labs should have the A-team on control and the B-team on alignment, and I have the same complaint about that claim. It doesn't make much sense for research, most of which helps with both. It does make sense as a distinction for "what plan will you implement in practice" -- but labs have said very little publicly about that.

Other things that characterize work done under the name of "control" so far are (1) it tries to be very concrete about its threat models, to a greater degree than most other work in AI safety, and (2) it tries to do assurance, taking a very worst case approach. Maybe you're saying that people should do those things more, but this seems way more contentious and I'd probably just straightforwardly disagree with the strength of your recommendation (though probably not its direction).

Nitpick: I would also quibble with your definitions; under your definitions, control seems like a subset of alignment (the one exception if you notice the model is scheming and then simply stop using AI). I think you really have to define alignment as models reliably doing what you want independent of the surrounding context, or talk about "trying to do what you want" (which only makes sense when applied to models, so has similar upshots).

Tbc I like control and think more effort should be put into it; I just disagree with the strength of the recommendation here.

Fwiw the thing that scares me is the combination of three things:

  1. Producing an accurate estimate requires "understanding" the input
  2. Our standard for the quality of estimates will be very high
  3. The approach is purely algorithmic (rather than e.g. using the flexible intelligence of LLMs / humans)

You presumably agree with (1) and (3). For (2), it seems like there's a lot of correlations you need to take into account for this to produce estimates that aren't wildly off:

  1. Underestimates of risk: These are cases where the presumption of independence neglects correlations between things that happen for catastrophe-causing events. For example, maybe a model is planning to defect when RSA-2048 is factored, and it has N different consistency checks it runs to make sure we aren't fooling it about whether it has been factored. The method has to see that there's a common cause for all the consistency checks to return 1, otherwise it will underestimate risk by a factor exponential in N. (And tbc I think even separately from more exotic examples like RSA-2048, it will be relatively common for this pattern to arise with networks, e.g. it happens any time that in reality there is a common cause that produces many pieces of evidence, and then the network aggregates those pieces of evidence to infer the common cause.)
  2. Overestimates of risk: These are cases where the presumption of independence neglects correlations that rule out catastrophe, and so lead to overestimates. For example, the model might be willing to make up fake instructions for creating bioweapons to help with scifi writing, but never provide real instructions. So the method needs to capture the "understanding" of which instructions are fake vs real.

I agree this isn't a proof of impossibility, since a purely algorithmic approach (SGD) produced the "understanding" in the first place, so in theory a purely algorithmic approach could still capture all that understanding to produce accurate estimates. But it does seem heuristically like you should assign a fairly low probability that this pans out.

A few questions:

  • The literature review is very strange to me. Where is the section on certified robustness against epsilon-ball adversarial examples? The techniques used in that literature (e.g. interval propagation) are nearly identical to what you discuss here.
  • Relatedly, what's the source of hope for these kinds of methods outperforming adversarial training? My sense from the certified defenses literature is that the estimates they produce are very weak, because of the problems with failing to model all the information in activations. (Note I'm not sure how weak the estimates actually are, since they usually report fraction of inputs which could be certified robust, rather than an estimate of the probability that a sampled input will cause a misclassification, which would be more analogous to your setting.)
  • If your catastrophe detector involves a weak model running many many inferences, then it seems like the total number of layers is vastly larger than the number of layers in M, which seems like it will exacerbate the problems above by a lot. Any ideas for dealing with this? 
  • What's your proposal for the distribution  for Method 2 (independent linear features)?

This suggests that we must model the entire distribution of activations simultaneously, instead of modeling each individual layer.

  • Why think this is a cost you can pay? Even if we ignore the existence of C and just focus on M, and we just require modeling the correlations between any pair of layers (which of course can be broken by higher-order correlations), that is still quadratic in the number of parameters of M and so has a cost similar to training M in the first place. In practice I would assume it is a much higher cost (not least because C is so much larger than M).

Suppose you trained a regular SAE in the normal way with a dictionary size of 2304. Do you expect the latents to be systematically different from the ones in your meta-SAE?

For example, here's one systematic difference. The regular SAE is optimized to reconstruct activations uniformly sampled from your token dataset. The meta-SAE is optimized to reconstruct decoder vectors, which in turn were optimized to reconstruct activations from the token dataset -- however, different decoder vectors have different frequencies of firing in the token dataset, so uniform over decoder vectors != uniform over token dataset. This means that, relative to the regular SAE, the meta-SAE will tend to have less precise / granular latents for concepts that occur frequently in the token dataset, and more precise / granular latents for concepts that occur rarely in the token dataset (but are frequent enough that they are represented in the set of decoder vectors).

It's not totally clear which of these is "better" or more "fundamental", though if you're trying to optimize reconstructed loss, you should expect the regular SAE to do better based on this systematic difference.

(You could of course change the training for the meta-SAE to decrease this systematic difference, e.g. by sampling from the decoder vectors in proportion to their average magnitude over the token dataset, instead of sampling uniformly.)

Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.

I disagree that the AGI safety team should have 4 as its "bread and butter". The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it's good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.

A couple of more minor points:

  • I still basically believe the story from the 6-year-old debate theory, and see our recent work as telling us what we need to do on the journey to making our empirical work better match the theory. So I do disagree fairly strongly with the approach of "just hill climb on what works" -- I think theory gives us strong reasons to continue working on debate.
  • It's not clear to me where empirical work for future problems would fit in your categorization (e.g. the empirical debate work). Is it "safety theory"? Imo this is an important category because it can get you a lot of the benefits of empirical feedback loops, without losing the focus on AGI safety.

I'm not going to repeat all of the literature on debate here, but as brief pointers:

  • Factored cognition discusses intuitively why we can hope to approximate exponentially-sized trees of arguments (which would be tremendously bigger than arguments between people)
  • AI safety via debate makes the same argument for debate (by showing that a polynomial time judge can supervise PSPACE -- PSPACE-complete problems typically involve exponential-sized trees)
  • Cross-examination is discussed here
  • This paper discusses the experiments you'd do to figure out what the human judge should be doing to make debate more effective
  • The comments on this post discuss several reasons not to anchor to human institutions. There are even more reasons not to anchor to disagreements between people, but I didn't find a place where they've been written up with a short search. Most centrally, disagreements between people tend to focus on getting both people to understand their position, but the theoretical story for debate does not require this.

(Also, the "arbitrary amounts of time and arbitrary amounts of explanation" was pretty central to my claim; human disagreements are way more bounded than that.)

I do, but more importantly, I want to disallow the judge understanding all the concepts here.

I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)

"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?

The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.

So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.

Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)

I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:

  1. A fundamental law of physics is conservation of energy: energy can neither be created nor destroyed, only transformed from one form to another.
  2. Electricity is a form of energy.
  3. This box does not have an infinite source of energy.
  4. The above three together imply that the box cannot produce infinite electricity.

The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn't mean the judge understands the other claims, just that the judge isn't addressing them when deciding who wins the overall debate.

If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like "the principle has been tested many times", "in the tests, confirming evidence outweighs the disconfirming evidence", "there is an overwhelming scientific consensus behind it", "there is significant a priori theoretical support" (assuming that's true), "given the above the reasonable conclusion is to have very high confidence in conservation of energy". Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.

If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren't accepted by any authorities the judge trusts, I still think it's probably doable with a larger debate, but it's harder for me to play out what the debate would look like because I don't know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that's not central.

The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?

That seems right, but why is it a problem?

The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).

There are several different outs to this example:

  • You should at least be able to argue that the evidence does not support the conclusion, and that the boss should have substantial probability on "the box can make some electricity but not infinitely much".
  • You can recursively decompose the claim "perpetual motion machines are known to be impossible" until you get down to a claim like "such and such experiment should have such and such outcome", which the boss can then perform to determine a winner.
    • This does not mean that the boss then understands why perpetual motion machines are impossible -- an important aspect of debate that it aims to produce good oversight of claims without giving the judge an understanding of those claims.
    • This particular approach will likely run into the problem of obfuscated arguments though.
  • The debaters are meant to be copies of the same AI, and to receive exactly the same information, with the hope that each knows what the other knows. In the example, this hopefully means that you understand how the inventor is tricking your boss, and you can simply point it out and explain it.
    • If the inventor legitimately believes the box produces infinite electricity, this won't work, but also I consider that out of scope for what debate needs to do. We're in the business of getting the best answer given the AI's knowledge, not the true answer.
    • If both you and the inventor know that the claim is impossible from theory, but don't know the local error that the inventor made, this won't work.
  • You can cross-examine the inventor and show that in other contexts they would agree that perpetual energy machines are impossible. (Roughly speaking, cross-examination = wiping memory and asking a new question.)

The process proposed in the paper

Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.

Load More