Lawrence Chan

I do AI Alignment research. Currently at ARC Evals, though I still dabble in grantmaking and interpretability in my spare time. 

I'm also currently on leave from my PhD at UC Berkeley's CHAI. 

Obligatory research billboard website: https://chanlawrence.me/

Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions

Comments

Thanks!

Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).

Thanks for writing this up! 

I'm curious about this:

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

What motivated people in particular? What was surprising?

Minor clarifying point: Act-adds cannot be cast as ablations.

Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.

Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper? 

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or that their theory of change is either doomed or useless.

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways. 

The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work

There's a lot of interpretability work that performs act-add like ablations to confirm that their directions are real, and ITI is basically act adds but they compute act adds with many examples instead of just a pair. But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods. 

Glad to see that this work is out! 

I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model. 

The main funders are LTFF, SFF/Lightspeed/other S-process stuff from Jaan Tallinn, and Open Phil. LTFF is the main one that solicits independent researcher grant applications.

There's a lot of orgs, off the top of my head, there's Anthropic/OpenAI/GDM as the scaling labs with decent-sized alignment teams, and then there's a bunch of smaller/independent orgs:

  • Alignment Research Center
  • Apollo Research
  • CAIS
  • CLR
  • Conjecture
  • FAR
  • Orthogonal
  • Redwood Research

And there's always academia.

(I'm sure I'm missing a few though!)

(EDIT: added in RR and CLR)

I think this has gotten both worse and better in several ways.

It's gotten better in that ARC and Redwood (and to a lesser extent, Anthropic and OpenAI) have put out significantly more of their research. FAR Labs also exists is also doing some of the research proliferation that would've gone on inside of Constellation. 

It's worse in that there's been some amount of deliberate effort to build more of an AIS community in Constellation, e.g. with explicit Alignment Days where people are encouraged to present work-in-progress and additional fellowships and workshops. 

On net I think it's gotten better, mainly because there's just been a lot more content put out in 2023 (per unit research) than in 2022. 

I suspect the underfitting explanation is probably a lot of what's going on given the small models used by the authors. But in the case of larger, more capable models, why would you expect it to be underfitting instead of generalization (properly fitting)? 

Thanks for posting this, this seems very correct. 

Load More