Raymond Arnold

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.



It's still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team's update against architectural approaches. I found this remark at the end interesting:

Finally, we note that in some of these expanded theories of superposition, finding the "correct number of features" may not be well-posed. In others, there is a true number of features, but getting it exactly right is less essential because we "fail gracefully", observing the "true features" at resolutions of different granularity as we increase the number of learned features in the autoencoder.

I don't think I've parsed the paper well enough to confidently endorse how well the paper justifies its claims, but it seemed to pass a bar that was worth paying attention to, for people tracking progress in the interpretability field.

One comment I'd note is that I'd have liked more information about how the feature interpretability worked in progress. The description is fairly vague. When reading this paragraph:

We see that features are substantially more interpretable than neurons. Very subjectively, we found features to be quite interpretable if their rubric value was above 8. The median neuron scored 0 on our rubric, indicating that our annotator could not even form a hypothesis of what the neuron could represent! Whereas the median feature interval scored a 12, indicating that the annotator had a confident, specific, consistent hypothesis that made sense in terms of the logit output weights.  

I certainly can buy that the features were a lot more interpretable than individual neurons, but I'm not sure, at the end of the day, how useful the interpretations of features were in absolute terms.

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

Curated. I liked both the concrete array of ideas coming from someone who has a fair amount of context, and the sort of background models I got from reading each of said ideas.


I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

What background knowledge do you think this requires? If I know a bit about how ML and language models work in general, should I be able to reason this out from first principles (or from following a fairly obvious trail of "look up relevant terms and quickly get up to speed on the domain?"). Or does it require some amount of pre-existing ML taste?

Also, do you have a rough sense of how long it took for MATS scholars?


I like that this went out and did some 'field work', and is clear about the process so you can evaluate how compelling to find it. I found the concept of a conflationary alliance pretty helpful. 

That said, I don't think the second half of the article argues especially well for a "consciousness conflationary alliance" existing. I did immediately think "oh this seems like a fairly likely thing to exist as soon as it's pointed out" (in particular given some recent discussion on why consciousness is difficult to talk about), but I think if it wasn't immediately intuitive to me the second-half-of-the-post wouldn't have really have convinced me.

Still, I like this post for object-level helping me realize how many ways people were using "consciousness", and giving me some gears to think about re: how rationality might get wonky around politics.

Curated. This seems like an important set of considerations for alignment researchers to think about.

meta note on tagging:

This post seemed to be on a topic that... surely there should be commonly used LW concept for, but I couldn't think of it. I tagged it "agent foundations" but feel like there should be something more specific.

ironically I missed this post when you first posted it

Load More