Lawrence Chan

I do AI Alignment research. Currently at ARC Evals, though I still dabble in grantmaking and interpretability in my spare time. 

I'm also currently on leave from my PhD at UC Berkeley's CHAI. 

Obligatory research billboard website: https://chanlawrence.me/

Sequences

(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions

Comments

Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG. 

Both Alex Lyzhov and Jacob Pfau also post on LW/AF:

Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I. 

Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!

Sure, though it seems too general or common to use a long word for it?

Maybe "linear intervention"?

Thanks!

Yeah, I think ELK is surprisingly popular in my experience amongst academics, though they tend to frame it in terms of partial observability (as opposed to the measurement tampering framing I often hear EA/AIS people use).

Thanks for writing this up! 

I'm curious about this:

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

What motivated people in particular? What was surprising?

Minor clarifying point: Act-adds cannot be cast as ablations.

Sorry, ablation might be the wrong word here (but people use it anyways): the technique is to subtract/add/move along the discovered direction and see what happens to the outputs. It's possible there's a better or standard word that I can't think of write now.

Also, another example of an attempt at interp -> alignment would arguably be the model editing stuff following causal tracing in the ROME paper? 

This is why I'm pessimistic about most interpretability work. It just isn't focused enough

Most of the "exploratory" interp work you suggest is trying to achieve an ambitious mechanistic understanding of models, which requires a really high degree of model understanding in general. They're not trying to solve particular concrete problems, and it seems unfair to evaluate them according to a different theory of change. If you're going to argue against this line of work, I think you should either argue that they're failing to achieve their theory of change, or that their theory of change is either doomed or useless.

So: do you think that ambitious mech interp is impossible? Do you think that current interp work is going the wrong direction in terms of achieving ambitious understanding? Or do you think that it'd be not useful even if achieved?

I agree that if your theory of change for interp goes through, "interp solves a concrete problem like deception or sensor tampering or adversarial robustness", then you better just try to solve those concrete problems instead of improving interp in general. But I think the case for ambitious mech interp isn't terrible, and so it's worth exploring and investing in anyways. 

The only example of interpretability leading to novel alignment methods I know of is shard theory's recent activation additions work

There's a lot of interpretability work that performs act-add like ablations to confirm that their directions are real, and ITI is basically act adds but they compute act adds with many examples instead of just a pair. But again, most mech interp people aren't aiming to use mech interp to solve a specific concrete problem you can exhibit on models today, so it seems unfair to complain that most of the work doesn't lead to novel alignment methods. 

Glad to see that this work is out! 

I don't have much to say here, especially since I don't want to rehash the old arguments about the usefulness of prosaic adversarial ML research. (I think it's worth working on but the direct impacts of the work are unclear). I do think that most people in AIS agree that image advexes are challenging and generally unsolved, but the people who disagree on the relevance of this line of research tend to question the implied threat model. 

Load More