Disclaimer: I work as a researcher at Anthropic, but this post entirely represents my own views, rather than the views of my own employer
I’ve spent the past two years getting into the field of AI Safety. One important message I heard as I was entering the field was that I needed to “form an inside view about AI Safety”, that I needed to form my own beliefs and think for myself rather than just working on stuff because people smarter than me cared about it. And this was incredibly stressful! I think the way I interpreted this was pretty unhealthy, caused me a lot of paralysing uncertainty and anxiety, and almost caused me to give up on getting into the field. But I feel like I’ve now reached a point I’m comfortable with, and where I somewhat think I have my own inside views on things and understand how to form them.
In this post, I try to explain the traps I fell into and why, what my journey actually looked like, and my advice for how to think about inside views, now I’ve seen what not to do. This is a complex topic and I think there are a lot of valid perspectives, but hopefully my lens is novel and useful for some people trying to form their own views on confusing topics (AI Safety or otherwise)! (Note: I don’t discuss why I do now think AI Safety is important and worth working on - that’s a topic for a future post!)
First, context to be clear about what I mean by inside views. As I understand it, this is a pretty fuzzily defined concept, but roughly means “having a clear model and argument in my head, starting from some basic and reasonable beliefs about the world, that get to me to a conclusion like ‘working on AI Safety is important’ without needing to rely on deferring to people”. This feels highly related to the concept of gears-level models. This is in comparison to outside views, or deferring to people, where the main reason I believe something is because smart people I respect believe it. In my opinion, there’s a general vibe in the rationality community that inside views are good and outside views are bad (see Greg Lewis’ In Defence of Epistemic Modesty for a good argument for the importance of outside views and deferring!). Note that this is not the Tetlockian sense of the words, used in forecasting, where outside view means ‘look up a base rate’ and inside view means ‘use my human intuition, which is terribly calibrated’, where the standard wisdom is outside view > inside view.
Good examples of this kind of reasoning: Buck Shlegeris’ My Personal Cruxes for Working on AI Safety, Richard Ngo’s AGI Safety from First Principles, Joseph Carlsmith’s report on Existential Risk from Power-Seeking AI. Note that, while these are all about the question of ‘is AI Safety a problem at all’, the notion of an inside view also applies well to questions like ‘de-confusion research/reinforcement learning from human feedback/interpretability is the best way to reduce existential risk from AI’, arguing for specific research agendas and directions.
I’m generally a pretty anxious person and bad at dealing with uncertainty, and sadly, this message resulted in a pretty unhealthy dynamic in my head. It felt like I had to figure out for myself the conclusive truth of ‘is AI Safety a real problem worth working on’ and which research directions were and were not useful, so I could then work on the optimal one. And that it was my responsibility to do this all myself, that it was bad and low-status to work on something because smart people endorsed it.
This was hard and overwhelming because there are a lot of agendas, and a lot of smart people with different and somewhat contradictory views. So this felt basically impossible. But it also felt like I had to solve this before I actually started any permanent research positions (ie by the time I graduated) in case I screwed up and worked on something sub-optimal. And thus, I had to solve this problem that empirically most smart people must be screwing up, and do it all before I graduated. This seemed basically impossible, and created a big ugh field around exploring AI Safety. Which was already pretty aversive, because it involved re-skilling, deciding between a range of different paths like PhDs vs going straight into industry, and generally didn’t have a clean path into it.
So, what actually happened to me? I started taking AI Safety seriously in my final year of undergrad. At the time, I bought the heuristic arguments for AI Safety (like, something smarter than us is scary), but didn’t really know what working in the field looked like beyond ‘people at MIRI prove theorems I guess, and I know there are people at top AI labs doing safety stuff?’ I started talking to lots of people who worked in the field, and gradually got data on what was going on. This was all pretty confusing and stressful, and was competing with going into quant finance - a safe, easy, default path that I already knew I’d enjoy.
After graduating, I realised I had a lot more flexibility than I thought. I took a year out, and managed to finagle my way into doing three back-to-back AI Safety internships. The big update was that I could explore AI Safety without risking too much - I could always go back into finance in a year or two if it didn’t work out. I interned at FHI, DeepMind and CHAI - working on mathematical/theoretical safety work, empirical ML based stuff to do with fairness and bias, and working on empirical interpretability work respectively. I also did the AGI Fundamentals course, and chatted to a lot of people at the various orgs I worked at and at conference. I tried to ask all the researchers I met about their theory of change for how their research actually matters. One thing that really helped me was chatting to a researcher at OpenAI who said that, when he started, he didn’t have clear inside views. But that he’d formed them fairly organically over time, and just spending time thinking and being in a professional research environment was enough.
At the end of the year, I had several offers and ended up joining Anthropic to work on interpretability with Chris Olah. I wasn’t sure this was the best option, but I was really excited about interpretability, and it seemed like the best bet. A few months in, this was clearly a great decision and I’m really excited about the work, but it wouldn’t have been the end of the world if I’d decided the work wasn’t very useful or a bad fit, and I expect I could have left within a few months without hard feelings. As I’ve done research and talked to Chris + other people here, I’ve started to form clearer views on what’s going on with interpretability and the theory of impact for it and Anthropic’s work, but there’s still big holes in my understanding where I’m confused or deferring to people. And this is fine! I don’t think it’s majorly holding me back from having an impact in the short-term, and I’m forming clearer views with time.
I think there are four main reasons to care about forming inside views:
These are pretty different, and it’s really important to be clear about which reasons you care about! Personally, I mostly care about motivation > research quality = impact >> community epistemics
I'm currently working to form my own models here. I'm not sure if this post concretely helped me but it's nice to see other people grappling with it.
One thing I notice is that this post is sort of focused on "developing inside views as a researcher, so you can do research." But an additional lens here is "Have an inside view so you can give advice to other people, or do useful community building, or build useful meta-tools for research, or fund research.
In my case I already feel like I have a solid inside view of "AGI is important", and "timelines might be short-ish", but am lacking in "okay, but what do we do about it?". And the most important problem (for me) is that as a LessWrong admin and curator of various in-person offices/retreats/projects, I'm not sure which specific projects to focus on fostering.
I have a general sense of "all the promising avenues should ideally be getting funding". But attention is still limited. Which things make sense to curate? Which things make sense to crosspost to Alignment Forum? A retreat or office can only have so many people, and it matters not just if the people are "good/promising" in some sense, but that they have good intellectual chemistry.
I also have a sense that, working in a meta-org, some of the best meta-work is to provide concrete object-level-work for people to do.
Just a small remark
Open a blank google doc, set a one hour timer, and start writing out your case for why AI Safety is the most important problem to work on
Not "why", but "whether" is the first step. Otherwise you end up being a clever arguer.
No, "why" is correct. See the rest of the sentence:
Write out all the counter-arguments you can think of, and repeat
It's saying assume it's correct, then assume it's wrong, and repeat. Clever arguers don't usually devil advocate themselves.