Authors' Contributions: Both authors contributed equally to this project as a whole. Evan did the majority of implementation work, as well as the work for writing this post. Megan was more involved at the beginning of the project, and did the majority of experiment design. While Megan did give some...
This paper came out about a week ago. I am not the author - it was published anonymously. I learned about the paper when JJ Hepburn shared it on Slack. I just thought it seemed potentially really important and I hadn't seen it discussed on this forum yet: Paper title:...
Back in April, Google AI announced SayCan, a project which integrated a language model (FLAN) with robotics in order to produce a robot which could follow instructions. For example, cleaning up a mess in a kitchen. (LessWrong post from April which links to a tweet about SayCan.) This week Google...
This is the second post in the sequence “Interpretability Research for the Most Important Century”. The first post, which introduces the sequence, defines several terms, and provides a comparison to existing works, can be found here: Introduction to the sequence: Interpretability Research for the Most Important Century. Summary This post...
This is the first post in a sequence exploring the argument that interpretability is a high-leverage research activity for solving the AI alignment problem. This post contains important background context for the rest of the sequence. I'll give an overview of one of Holden Karnofksy's (2022) Important, actionable research questions...
Yesterday on the Weekly Alignment Research Coffee Time call as people were sharing updates on their recent work, Vanessa Kosoy brought up her and Diffractor's post Infra-Bayesian physicalism: a formal theory of naturalized induction which she was interested in getting feedback on. Vanessa seemed a bit disappointed/frustrated that this post...