This seems like really great work, nice job! I'd be excited to see more empirical work around inner alignment.
One of the things I really like about this work is the cute videos that clearly demonstrate 'this agent is doing dumb stuff because its objective is non-robust'. Have you considered putting shorter clips of some of the best bits to Youtube, or making GIFs? (Eg, a 5-10 second clip of the CoinRun agent during train, followed by a 5-10 second clip of the CoinRun agent during test). It seemed that one of the major strengths of the CoastRunners clip was how easily shareable and funny it was, and I could imagine this research getting more exposure if it's easier to share highlights. I found the Google Drive pretty hard to navigate
One or two people suggested adding links to interesting papers that I wouldn't have time to summarize. I actually used to do this when the newsletter first started, but it seemed like no one was clicking on those links so I stopped doing that. I'm pretty sure that would still be the case now so I'm not planning to restart that practice.
A possible experiment: Frame this as a 'request for summaries', link to the papers you won't get round to, but offer to publish any sufficiently good summaries of those papers that someone sends you in a future newsletter.
Also, damn! I really like the long summaries, and would be sad to see them go (though obviously you should listen to a survey of 66 people over my opinion)
It's not exactly clear what you do with such a story or what the upside is, it's kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that's useful on a broader variety of perspectives / to people who are skeptical).
Ah, interesting! I'm surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.
Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.
Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn't important here, just the vague intuition for how AI goes wrong?
What's the engine game?
What research in the past 5 years has felt like the most significant progress on the alignment problem? Has any of it made you more or less optimistic about how easy the alignment problem will be?
Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?
What is your theory of change for the Alignment Research Center? That is, what are the concrete pathways by which you expect the work done there to systematically lead to a better future?
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you've done some great work rectifying this!). Why is this? And what are the most compelling threat models that don't have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?
Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?
You seem in the unusual position of having done excellent conceptual alignment work (eg with IDA), and excellent applied alignment work at OpenAI, which I'd expect to be pretty different skillsets. How did you end up doing both? And how useful have you found ML experience for doing good conceptual work, and vice versa?