Thomas Kwa

Doing alignment research with Vivek Hebbar's team at MIRI as well as independent projects.


Catastrophic Regressional Goodhart

Wiki Contributions


Prediction market for whether someone will strengthen our results or prove something about the nonindependent case:

Downvoted, this is very far from a well-structured argument, and doesn't give me intuitions I can trust either

I'm fairly sure you can get a result something like "it's not necessary to put positive probability mass on two different functions that can't be distinguished by observing only s bits", so some functions can get zero probability, e.g. the XOR of all combinations of at least s+1 bits.

edit: The proof is easy. Let  be two such indistinguishable functions that you place positive probability on, F be a random variable for the function, and F' be F but with all probability mass for  replaced by . Then . But this means  and so  You don't lose any channel capacity switching to 

  • Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon.
  • I thought briefly about the Ilharco et al paper and am very impressed by it as well.
  • Thanks for linking to the resources.

I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I've seen quite matches the promise this post has in finding ways to exert fine-grained control over a system's internals; we now have a wide variety of concrete questions like

  • how to find steering vectors for new behaviors e.g. speaking French?
  • how to make these techniques more robust?
  • What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
  • Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
  • Are the effects of steering vectors nonlinear for small coefficients? What does this mean about superposition?
  • What's the mechanism by which adding a steering vector with too large a coefficient breaks the model?
  • Adding steering vectors at different layers surely means you are intervening at different "stages of processing". What do the model's internal concepts look like at different stages?

Comparing this to other work, my sense is that

  • intervening on activations is better than training (including RLHF), because this builds towards understanding systems rather than steering a black box with a black-box reward model, and for the reasons the authors claim.
  • Debate, although important, seems less likely to be a counterfactual, robust way to steer models. The original debate agenda ran into serious problems, and neither it nor the current Bowman agenda tells us much about the internals of models.
  • steering a model with activation vectors is better than mechinterp (e.g. the IOI paper), because here you've proven you can make the AI do a wide variety of interesting things, plus mechinterp is slow
  • I'm not up to date on the adversarial training literature (maybe academia has produced something more impressive), but I think this is more valuable than the Redwood paper, which didn't have a clearly positive result. I'm glad people are working on adversarial robustness.
  • steering the model using directions in activation space is more valuable than doing the same with weights, because in the future the consequences of cognition might be far-removed from its weights (deep deceptiveness)

It's a judgement call whether this makes it the most impressive achievement, but I think this post is pretty clearly Pareto-optimal in a very promising direction. That said, I have a couple of reservations:

  • By "most impressive concrete achievement" I don't necessarily mean the largest single advance over SOTA. There have probably been bigger advances in the past (RLHF is a candidate), and the impact of ELK is currently unproven but will shoot to the top if mechanistic anomaly detection ever pans out.
  • I don't think we live in a world where you can just add a "be nice" vector to a nanotech-capable system and expect better consequences, again for deep deceptiveness-ish reasons. Therefore, we need advances in theory to convert our ability to make systems do things into true mastery of cognition.
  • I don't think we should call this "algebraic value editing" because it seems overly pretentious to say we're editing the model's values We don't even know what values are! I don't think RLHF is editing values, in the sense that it does something different from even the weak version of instilling desires to create diamonds, and this seems even less connected to values. The only connection is it's modifying something contextually activated which is way too broad.
  • It's unclear that this works in a wide range of situations, or in the situations we need it to for future alignment techniques. The authors claim that cherry-picking was limited, but there are other uncertainties: when we need debaters that don't collude to mislead the judge, will we be able to use activation patching? What if we need an AI that doesn't self-modify to remove some alignment property?

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions.

SGD has inductive biases, but we'd have to actually engineer them to get high  rather than high  when only trained on . In the Gao et al paper, optimization and overoptimization happened at the same relative rate in RL as in conditioning, so I think the null hypothesis is that training does about as well as conditioning. I'm pretty excited about work that improves on that paper to get higher gold reward while only having access to the proxy reward model.

I think the point still holds in mainline shard theory world, which in my understanding is using reward shaping + interp to get an agent composed of shards that value proxies that more often correlate with high  rather than higher , where we are selecting on something other than . When the AI ultimately outputs a plan for alignment, why would it inherently value having the accurate plan, rather than inherently value misleading humans? I think we agree that it's because SGD has inductive biases and we understand them well enough to do directionally better than conditioning at constructing an AI that does what we want.

That section is even more outdated now. There's nothing on interpretability, Paul's work now extends far beyond IDA, etc. In my opinion it should link to some other guide.

This seems good if it could be done. But the original proposal was just a call for labs to individually pause their research, which seems really unlikely to work.

Also, the level of civilizational competence required to compensate labs seems to be higher than for other solutions. I don't think it's a common regulatory practice to compensate existing labs like this, and it seems difficult to work out all the details so that labs will feel adequately compensated. Plus there might be labs that irrationally believe they're undervalued. Regulations similar to the nuclear or aviation industry feel like a more plausible way to get slowdown, and have the benefit that they actually incentivize safety work.

Load More