Solving Interpretability Week

Logan Riggs

For original motivation, see solving corrigibility week. I'll state what I learned from corrigibility week, why I didn't post one last week, and the updated format for interpretability.

What I Noticed

The open call to co-work with my calendly link didn't work. Only met with people when I messaged them specifically. I'm changing my availability from "all day Saturday" to "two hours/day" to better meet other's schedule and not have a single point of failure (see next section). It also seems great to ask specific people to meet when I have a question on their research.

Going through previous work, it was good to write out my thoughts in the google doc, spruce it up, and make it a comment on the original post. This also connects with messaging people to meet with if I feel there's a large disconnect or I have a lot of confusion regarding their work.

The google doc was also less long-term collaborative, which may have been from a lack of notifications that someone responded to what you wrote. So I'm moving the research direction part to the comment section here.

No Post Last Week

I committed to posting 3 of these this month, but didn't last week. I had a bad Saturday (which was my designate work day for the corrigibility post) and a bad week until I talked to Shay, my therapist, and worked out a plan (Shay is great and has free sessions for those working on alignment; highly recommend). I am not ashamed or embarrassed, but I do put less stock in my public commitments and more aware of my failure modes. I personally still recommend people try (and possibly fail) ambitious projects.

I do want to write a post on corrigibility, but it's too time-consuming to both work out my own thoughts going through the literature and meet with people to understand their work & distill those conversations. Both are important and not mutually exclusive. Some possible solutions are to circle back to corrigibility next year or take two weeks per topic.

Format This Week

Here is the google doc for interpretability. In the comments are top-level comments for:

Research directions you want discussed with any questions you have
Your meeting schedule for co-working and how you'd like to be contacted
Suggestions for future weeks or changes to the format

In the doc are:

Literature Review
Tasks to do for further research

Again, in the google doc, it is socially acceptable to write low-quality babble. In this post's comments, I also accept babbling/ spit-balling and will not delete them.

Any suggestions for the format in future weeks? Or a criticism of the idea in general?

Do you want to co-work? Please include your availability and way to contact you (I personally recommend calendly)

I'm available for co-working to discuss any post or potential project on interpretability or if you'd like someone to bounce ideas off of. My calendly link is here, I'm available all week at many times, and I won't take more than 2 meetings in a day, but I'll email you within the day to reschedule if that happens.

I'm interested in trying a co-work call sometime but won't have time for it this week.

Thanks for sharing about Shay in this post. I had not heard of her before, what a valuable resource/way she's helping the cause of AI safety.

(As for contact, I check my LessWrong/Alignment Forum inbox for messages regularly.)

What are research directions you want discussed? Is there a framework or specific project you think would further transparency and interpretability?