Raymond Arnold

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.


It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

Fwiw this doesn't feel like a super helpful comment to me. I think there might be a nearby one that's more useful, but this felt kinda coy for the sake of being coy.

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title. 

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of something as part of your attempt to understand it, and upskill (I think I had previously advocated this sometimes). I think this was basically a reasonable thing to do, but, it did have the cumulative effect of decreasing the signal-noise ratio of distillations, since most people doing this aren't skilled yet.

None of that contradicts the claims of this post, which specify various skills you need, and recommends actually investing in those skills. (The title of this post is "call for Distillers", not "call for Distillations". I think a lot of what happened was things like "Distillation contests", and incorporating distillation into SERI MATS programming, etc, which doesn't automatically produce dedicated distillers)


It's still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team's update against architectural approaches. I found this remark at the end interesting:

Finally, we note that in some of these expanded theories of superposition, finding the "correct number of features" may not be well-posed. In others, there is a true number of features, but getting it exactly right is less essential because we "fail gracefully", observing the "true features" at resolutions of different granularity as we increase the number of learned features in the autoencoder.

I don't think I've parsed the paper well enough to confidently endorse how well the paper justifies its claims, but it seemed to pass a bar that was worth paying attention to, for people tracking progress in the interpretability field.

One comment I'd note is that I'd have liked more information about how the feature interpretability worked in progress. The description is fairly vague. When reading this paragraph:

We see that features are substantially more interpretable than neurons. Very subjectively, we found features to be quite interpretable if their rubric value was above 8. The median neuron scored 0 on our rubric, indicating that our annotator could not even form a hypothesis of what the neuron could represent! Whereas the median feature interval scored a 12, indicating that the annotator had a confident, specific, consistent hypothesis that made sense in terms of the logit output weights.  

I certainly can buy that the features were a lot more interpretable than individual neurons, but I'm not sure, at the end of the day, how useful the interpretations of features were in absolute terms.

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

Curated. I liked both the concrete array of ideas coming from someone who has a fair amount of context, and the sort of background models I got from reading each of said ideas.


I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

plus-one-ing  the impulse to "look for third options"

What background knowledge do you think this requires? If I know a bit about how ML and language models work in general, should I be able to reason this out from first principles (or from following a fairly obvious trail of "look up relevant terms and quickly get up to speed on the domain?"). Or does it require some amount of pre-existing ML taste?

Also, do you have a rough sense of how long it took for MATS scholars?


I like that this went out and did some 'field work', and is clear about the process so you can evaluate how compelling to find it. I found the concept of a conflationary alliance pretty helpful. 

That said, I don't think the second half of the article argues especially well for a "consciousness conflationary alliance" existing. I did immediately think "oh this seems like a fairly likely thing to exist as soon as it's pointed out" (in particular given some recent discussion on why consciousness is difficult to talk about), but I think if it wasn't immediately intuitive to me the second-half-of-the-post wouldn't have really have convinced me.

Still, I like this post for object-level helping me realize how many ways people were using "consciousness", and giving me some gears to think about re: how rationality might get wonky around politics.

Load More