I have written down a long list of alignment ideas that I’d be interested in working on. The ideas roughly boil down to “To make progress on alignment, we need to understand Deep Learning models and the process by which they arrive at their final parameters in much more detail than we currently do”.
Here is the link to the full version (comments are on, please don’t abuse it): https://docs.google.com/document/d/1AyuTphQ31rLHDtpZoEwEPb4fWbZna1H3hGx_YUACxk4/edit?usp=sharing
The rest of this post is an overview copied from the doc. Feedback is welcome.
By Science of DL, I roughly mean “understanding DL systems and how they learn concepts” better. The main goal is to propose a precise and testable hypothesis related to a phenomenon in DL and then test and refine it until we are highly confident in its truth or falsehood. This hypothesis could be about how NNs behave on the neuron level, the circuit level, during training, during fine-tuning, etc. This research will almost surely at some point include mechanistic interpretability but it is not limited to it. The refined statement after investigation can but doesn’t have to be of mathematical form as long as it is unambiguous and can be tested, i.e. two people could agree on an experiment that would provide evidence for or against the statement and then run it.
The details would obviously differ from project to project but on a high level I imagine it to look roughly like this
The goal of this research is to understand DL systems as well as possible. This means there is not one clear goal by that we could judge our performance. However, I think there are some ways to test whether we actually increased our understanding of different parts of the system. These include
Understanding more parts of the DL pipeline can always also lead to an increase in dangerous capabilities. Essentially, whenever we understand technology better, we can use that knowledge to make it more efficient or powerful.
I’m currently excited about this agenda and will likely explore some of the project ideas in the long doc in the near future. However, I’m still uncertain how promising I find the agenda compared to other approaches to alignment. Feedback and considerations are welcome.