Raymond Arnold

LessWrong team member / moderator. I've been a LessWrong organizer since 2011, with roughly equal focus on the cultural, practical and intellectual aspects of the community. My first project was creating the Secular Solstice and helping groups across the world run their own version of it. More recently I've been interested in improving my own epistemic standards and helping others to do so as well.

Comments

Ah yeah sorry I didn't mean to convey that. For now I'm (clumsily) edited the original comment to be more clear.

Curated.

The overall point here seems true and important to me.

I think I either disagree, or am agnostic about, some of the specific examples given in the Myth vs Reality section. I don't think they're loadbearing for the overall point. I may try to write those up in more detail later.

Curated. I found this a helpful way of carving up the AI safety space. 

I agree with Ryan Greenblatt's clarification in comments that no, this doesn't mean we're completely safe if we can rule out Rogue Deployments, but, it still seems like a useful model for reasoning about what kinds of failures are more or less likely.

[edit: oh, to clarify, I don't think Buck meant to imply that either in the original post, which goes out of it's way to talk about catastrophes without rogue deployments. It just seemed like a confusion I expected some people to have]

It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.

I didn't get this from the premises fwiw. Are you saying it's trivial because "just don't use your AI to help you design AI" (seems organizationally hard to me), or did you have particular tricks in mind?

Fwiw this doesn't feel like a super helpful comment to me. I think there might be a nearby one that's more useful, but this felt kinda coy for the sake of being coy.

Since this post was written, I feel like there's been a zeitgeist of "Distillation Projects." I don't know how causal this post was, I think in some sense the ecosystem was ripe for a Distillation Wave) But it seemed useful to think about how that wave played out.

Some of the results have been great. But many of the results have felt kinda meh to me, and I now have a bit of a flinch/ugh reaction when I see a post with "distillation" in it's title. 

Basically, good distillations are a highly skilled effort. It's sort of natural to write a distillation of something as part of your attempt to understand it, and upskill (I think I had previously advocated this sometimes). I think this was basically a reasonable thing to do, but, it did have the cumulative effect of decreasing the signal-noise ratio of distillations, since most people doing this aren't skilled yet.

None of that contradicts the claims of this post, which specify various skills you need, and recommends actually investing in those skills. (The title of this post is "call for Distillers", not "call for Distillations". I think a lot of what happened was things like "Distillation contests", and incorporating distillation into SERI MATS programming, etc, which doesn't automatically produce dedicated distillers)

Curated. 

It's still unclear to me how well interpretability can scale and solve the core problems in superintelligence alignment, but this felt like a good/healthy incremental advance. I appreciated the exploration of feature splitting, beginnings of testing for universality, and discussion of the team's update against architectural approaches. I found this remark at the end interesting:

Finally, we note that in some of these expanded theories of superposition, finding the "correct number of features" may not be well-posed. In others, there is a true number of features, but getting it exactly right is less essential because we "fail gracefully", observing the "true features" at resolutions of different granularity as we increase the number of learned features in the autoencoder.

I don't think I've parsed the paper well enough to confidently endorse how well the paper justifies its claims, but it seemed to pass a bar that was worth paying attention to, for people tracking progress in the interpretability field.

One comment I'd note is that I'd have liked more information about how the feature interpretability worked in progress. The description is fairly vague. When reading this paragraph:

We see that features are substantially more interpretable than neurons. Very subjectively, we found features to be quite interpretable if their rubric value was above 8. The median neuron scored 0 on our rubric, indicating that our annotator could not even form a hypothesis of what the neuron could represent! Whereas the median feature interval scored a 12, indicating that the annotator had a confident, specific, consistent hypothesis that made sense in terms of the logit output weights.  

I certainly can buy that the features were a lot more interpretable than individual neurons, but I'm not sure, at the end of the day, how useful the interpretations of features were in absolute terms.

Curated, both for the OP (which nicely lays out some open problems and provides some good links towards existing discussion) as well as the resulting discussion which has had a number of longtime contributors to LessWrong-descended decision theory weighing in.

Curated. I liked both the concrete array of ideas coming from someone who has a fair amount of context, and the sort of background models I got from reading each of said ideas.

Curated.

I feel somewhat skeptical about model organisms providing particularly strong evidence of how things will play out in the wild (at least at their earlier stages). But a) the latter stages do seem like reasonable evidence, and it still seems like a pretty good process to start with the earlier stages, b) I overall feel pretty excited the question "how can we refactor the alignment problem into a format we can Do Science To?", and this approach seems promising to me.

Load More