Cross-posted from the EA Forum:

Recently I've been thinking about the pros and cons of working on near-term technical AI safety and assurance. This includes topics such as interpretability for near-term systems, generalizability / robustness, AI security, testing, verification, and the like.

Here are my own considerations so far:

(Note: In what follows I use the term Transformative AI (TAI) very loosely to mean any type of AI that has a decent chance of leading to a global catastrophe if safety challenges are not addressed first.)


  1. Some approaches to these topics might actually turn out to work directly for TAI, especially where those approaches may not be pursued given the default trajectory (i.e., without EA intervention) of research from industry / government / academia.
  2. This kind of research directly helps create a set of tools, techniques, organizations, regulations, etc., that iteratively builds on itself in the way that technology tends to do, such that whenever TAI becomes a real problem we will already have solutions or the resources to quickly find solutions.
  3. Promoting this kind of research in industry / gov't / academia helps influence others in those communities to create a set of tools, techniques, organizations, regulations, etc., such that whenever TAI becomes a real problem we will already have solutions or the resources to quickly find solutions.
  4. Research into these topics fosters a broader concern for AI safety topics in the general public (either directly or as a side effect of researchers / gov't / etc. respecting those topics more), which could lead to public pressure on industry / gov't to develop solutions, and that may help mitigate risks from TAI.

(For whatever it's worth, my personal inside view leans towards 3 as the most plausibly important from an EA point of view.)


  1. Research into these topics, if successful, would remove some very large barriers that are currently preventing AI from being deployed in many applications that would be extremely valuable to industry or government (including the military). Removing these barriers would dramatically increase the value of AI to industry and government, which would accelerate AI development in general, potentially leading to TAI arriving before we're ready for it.
  2. Research into these topics, if only partially successful, might remove enough barriers for industry / government to start deploying AI systems that eventually prove to be unsafe. Plausibly, those AI systems might become part of an ecosystem of other AIs which together have the potential to lead to a catastrophe (along the lines of Paul Christiano's "out with a whimper" or "out with a bang" scenarios).
  3. Dramatically increasing the value of AI could also potentially lead to arms races between corporations or governments, which could lead to one side or another cutting safety corners as they're developing TAI (races to the bottom).
  4. If you are concerned about lethal autonomous weapons, then removing these barriers might greatly increase the chance that various governments might deploy LAWs. This is true even if you're not working for the government, since the government definitely follows industry developments pretty closely.


I'm also interested in how these pros and cons might change if you're doing research for large organizations (industry or government) that might plausibly have the capacity to eventually build TAI-type systems, but where the research you do will not be publicly available due to proprietary or secrecy reasons. If it makes a difference, let's assume that you're working at a place that is reasonably ethical (as corporations and governments  go) and that is at least somewhat aware of AI ethics and safety concerns.

I think that in this situation you'd have both a reduction in the value of the pros (since your solutions won't spread beyond your organization, at least for some time) and in the potential damage of the cons (for the same reason). But it seems to me that the cons are still mostly there, and possibly made worse: The lowered barriers to deployment would still probably lead your organization to press its advantage, thereby increasing the market (or strategic) value of AI as perceived by competitors, thereby leading to more resources poured into AI research in general - only now the competition might not have all the best safety solutions available to it because they're proprietary.


I'm curious what others think about all this. I would also appreciate links to good previous discussions of these topics. The only one I know of at the moment is this post, which discusses some of these considerations but not all.

New Answer
New Comment

1 Answers sorted by

Samuel Dylan Martin


It depends somewhat on what you mean by 'near term interpretability' - if you apply that term to research into, for example, improving the stability and ability to access the 'inner world models' held by large opaque langauge models like GPT-3, then there's a strong argument that ML based 'interpretability' research might be one of the best ways of directly working on alignment research,

And see this discussion for more, 

Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above.

Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff.

So language model transparency/interpretability tools might be useful on the basis of pro 2) and also 1) to some extent, because it will help build tools for intereting TAI systems and alos help align them ahead of time.

1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires.

2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback.

3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world.

4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.