It's currently hard to know where to start when trying to get better at thinking about alignment. So below I've listed a few dozen exercises which I expect to be helpful. They assume a level of background alignment knowledge roughly equivalent to what's covered in the technical alignment track of the AGI safety fundamentals course. They vary greatly in difficulty - some are standard knowledge in ML, some are open research questions. I’ve given the exercises star ratings from * to *** for difficulty (note: not for length of time to complete - many require reading papers before engaging with them). However, I haven't tried to solve them all myself, so the star ratings may be significantly off.
I've erred on the side of including exercises which seem somewhat interesting and alignment-related even when I'm uncertain about their value; when working through them, you should keep the question "is this actually useful? Why or why not?" in mind as a meta-exercise. This post will likely be updated over time to remove less useful exercises and add new ones.
I'd appreciate any contributions of:
These are intended less as exercises and more as pointers to open questions at the cutting edge of deep learning.
I particularly appreciate the questions that ask one to look at a way that a problem was reified/specified/ontologized in a particular domain and asks for alternative such specifications. I thought Superintelligence (2014) might be net harmful because it introduced a lot of such specifications that I then noticed were hard to think around. I think there are a subset of prompts from the online course/book Framestorming that might be useful there, I'll go see if I can find them.
Great list of interesting questions, trains of thought and project ideas in this post.
I was a little surprised to not find any exercises on interpretability. Perhaps there was a reason for excluding it, but if not then here an idea for another exercise/group of exercises to include (perhaps could be merged into the "Neural networks" section):
Suggestion on Agency 2.1: rephrase so that the "Before reading his post" part comes before the link to the post. I assume there'll otherwise be some overzealous link followers.
Curated. Exercises are crucial for the mastery of topics and the transfer of knowledge, it's great to see someone coming up with them for the nebulous field of Alignment.
This post, by example, seems like a really good argument that we should spend a little more effort on didactic posts of this sort. E.g. rather than just saying "physical systems have multiple possible interpretations," we could point people to a post about a gridworld with a deterministic system playing the role of the agent, such that there are a couple different pretty-good ways of describing this agent that mostly agree but generalize in different ways.
This perspective might also be a steelmanning of that sort of paper where there's an abstract argument that does all the work, and then some code that tells you nothing new if you followed the abstract argument. The code (in the steelmanned story) isn't just there to make the paper be in the right literary genre or provide a semi-trustworthy signal that you're not a crank who makes bad abstract arguments, it's a didactic tool to help the reader do these sorts of exercises.
** Explain why cooperative inverse reinforcement learning doesn’t solve the alignment problem.
Feedback: I clicked through to the provided answer and had a great deal of difficulty understanding how it was relevant - it makes a number of assumptions about agents and utility functions and I wasn't able to connect it to why I should expect an agent trained using CIRL to kill me.
FWIW here's my alternative answer:
CIRL agents are bottlenecked on the human overseer's ability to provide them with a learning signal through demonstration or direct communication. This is unlikely to scale to superhuman abilities in the agent, so superintelligent agents simply will not be trained using CIRL.
In other words it's only a solution to "Learn from Teacher" in Paul's 2019 decomposition of alignment, not to the whole alignment problem.