Stephen Casper, firstname.lastname@example.org. Thanks to Alex Lintz and Daniel Dewey for feedback.
This is a reply but not an objection to a recent post from Paul Christiano titled AI alignment is distinct from its near term applications. The post is fairly brief and the key point is decently summed up by this excerpt.
I worry that companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself. If such systems are held up as key successes of alignment, then people who are frustrated with them may end up associating the whole problem of alignment with “making AI systems inoffensive.”
I have no disagreements with this claim. But I would push back against the general notion that AI [existential] safety work is disjoint from near term applications. Paul seems to agree with this.
We can develop and apply alignment techniques to these existing systems. This can help motivate and ground empirical research on alignment, which may end up helping avoid higher-stakes failures like an AI takeover.
This post argues for strongly emphasizing this point.
What do I mean by near-term applications? Any challenging problem involving consequential AI and society. Examples include:
I argue that working on these problems probably matters a lot for three reasons. The second and third of which are potential matters of existential safety.
There are many important non X-risks in this world, and any altruistic-minded person should care about them. For the same reason we should care about health, wealth, development, and animal welfare, we should also care about making important near-term applications of narrow AI go well for people.
Imagine that we figured out ways to make near-term applications of AI go very well. I find it incredibly hard to imagine a world in which we did any of these things without developing a lot of useful technical tools and governance strategies that could be retooled or built on for higher-stakes problems later. Consider some examples.
See also this post.
AI safety and longtermism (AIS&L) have a lot of critics, and in the past year or so, they seem to have grown in number and profile. Many of whom are people who work on and care a lot about near-term applications of AI. To some extent this is inevitable. Having an influential and disruptive agenda will inevitably lead to some pushback from competing ones. Haters are going to hate. Trolls are going to troll.
But AIS&L probably have more detractors than they should from people who should really be allies. Given how many forces in the world are making AI more risky, there shouldn’t be conflict between groups of people who are working on making it go better but in different ways. In what world could isolation and mutual dismissal between AIS&L people and people working on neartermist problems be helpful? There seem to be too many common adversaries and interests between the two groups to not be allies–especially for influencing AI governance. Having more friends and fewer detractors seems like it could only increase the political and research capital of the AIS&L community. There is also virtually no downside of being more popular.
I think that some negative press about AIS&L might be due to active or tacit dismissal of the importance of neartermist work by AIS&L people. Speaking for myself, I have had a number of conversations in the past few months with non AIS&L who seem sympathetic but have expressed feelings of dismissal by the community which has made them more hesistant to be involved. For this reason, we might stand to benefit a great deal from less parochialism and more friends.
Paul argues that
...companies using alignment to help train extremely conservative and inoffensive systems could lead to backlash against the idea of AI alignment itself.
But I think it is empirically, overwhelmingly clear that a much bigger concern when it comes to "backlash against the idea of AI alignment itself" comes from failures of the AIS&L community to engage with more neartermist work.
Thanks for reading--constructive feedback is welcome.
Non X-risks from AI are still intrinsically important AI safety issues.
I want to push back on this - I think it's true as stated, but that emphasising it can be misleading.
Concretely, I think that there can be important near-term, non-X-risk AI problems that meet the priority bar to work on. But the standard EA mindset of importance, tractability and neglectedness still applies. And I think often near-term problems are salient and politically charged, in a way that makes these harder to evaluate.
I think these are most justified on problems with products that are very widely used and without much corporate incentive to fix the issues (recommender system alignment is the most obvious example here)
I broadly agree with and appreciate the rest of this post though! And want to distinguish between "this is not a cause area that I think EAs should push on on the margin" and "this cause area does not matter" - I think work to make systems less deceptive, racist, and otherwise harmful seems pretty great.
No disagreements substance-wise. But I'd add that I think work to avoid scary autonomous weapons is likely at least as important as recommender systems. If this post's reason #1 were the only reason for working on nerartermist AI stuff, then it would probably be like a lot of other very worthy but likely not top-tier impactful issues. But I see it as emphasis-worthy icing on the cake given #2 and #3.
Cool, agreed. Maybe my main objection is just that I'd have put it last not first, but this is a nit-pick
I think that some near-future applications of AI alignment are plausible altruistic top priorities. Moreover, even when people disagree with me about prioritization, I think that people who want to use AI to accomplish contemporary objectives are important users. It's good to help them, understand the difficulties they encounter, and so on, both to learn from their experiences and make friends.
So overall I think I agree with the most important claims in this post.
Despite that, I think it's important for me personally (and for ARC) to be clear about what I care about---both because having clear research goals is important for doing good research, and because there's a significant risk of inadvertently misleading people about my priorities and views.
As an example, I've recently been writing about mechanistic anomaly detection. But if people think that I mean the kind of anomaly detection that is most helpful for avoiding self-driving car crashes, I think that would lead to them evaluating my research inappropriately. Many of my technical decisions seem very weird on this perspective, and if I tried to satisfy a reviewer with that perspective I think it would push me in a very unhealthy direction. On the other side, this equivocation might lead some people to be more sympathetic to or excited about my work for reasons I regard as kind of dishonest.
This example is related to your post, but not really to my OP (since in my OP I'm mostly talking about contemporary applications of alignment, which are more closely technically connected to my day-to-day work). I have similar views about the other examples you give in the "valuable lessons" section:
This isn't to say that I don't think such work is meaningful or interesting, or that I don't want to be friends with people who do it. Just that it's not what I'm doing, and it's worth being clear about that so that my work is being evaluated in an appropriate way.
The other extreme would be deliberately muddying the waters about distinctions between different technical problems, and while I don't think anyone is necessarily advocating for that I do think it would be very unwise.
Hi Paul, thanks. Nice reading this reply. I like your points here.
Some of what I say here might reveal a lack of keeping up well with ARC. But as someone who works primarily on interpretability, the thought of mechanistic anomaly detection techniques that are not useful for use in today's vision or language models seems surprising to me. Is there anything you can point me to to help me understand why an interpretability/anomaly detection tool that's useful for ASI or something might not be useful for cars?
We are mostly thinking about interpretability and anomaly detection designed to resolve two problems (see here):
We do hope that these are just special cases and that our methods will resolve a broader set of problems, but these two special cases loom really large. On the other hand, it's much less clear whether realistic failures for self-driving cars will involve the kind of mechanism distinctions we are trying to detect.
Also: there are just way more pressing problems for self-driving cars. And on top of all that, we are taking a very theoretical approach precisely because we are worried it may be difficult to study these problems until a future time very close to when they become catastrophic.
Overall I think that if someone looked at our research agenda and viewed it as an attempt to respond to reliability failures in existing models, the correct reaction should be more like "Why are they doing all of this instead of just empirically investigating which failures are most important for self-driving cars and then thinking about how to address those?" There's still a case for doing more fundamental theoretical research even if you are interested in more prosaic reliability failures, but (i) it's qualitatively much worse and I don't really believe it, (ii) this isn't what such research should look like. So I think it's pretty bad if someone is evaluating us from that perspective.
(In contrast I think this is a more plausible framing for e.g. work on adversarial evaluation and training. It might still lead an evaluator astray, but at least it's a very plausible research direction to focus on for prosaic reliability as well as being something we might want to apply to future systems.)
We do hope that these are just special cases and that our methods will resolve a broader set of problems.
I hope so too. And I would expect this to be the case for good solutions.
Whether they are based on mechanistic interpretability, probing, other interpretability tools, adversaries, relaxed adversaries, or red-teaming, I would expect methods that are good at detecting goal misgeneralization or deceptive alignment to also be useful for self-driving cars and other issues. At the end of the day, any misaligned model will have a bug -- some set of environments or inputs that will make it do bad things. So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
So I'm inclined to underline the key point of my original post. I want to emphasize the value of (1) engaging more with the rest of the community that doesn't identify themselves as "AI Safety" researchers and (2) being clear that we care about alignment for all of the right reasons. Albeit this should be discussed with the appropriate amount of clarity which was your original point.
So I struggle to think of an example of a tool that is good for finding insidious misaligning bugs but not others.
One example: A tool that is designed to detect whether a model is being truthful (correctly representing what it knows about the world to the user), perhaps based on specific properties of truthfulness (for example see https://www.alignmentforum.org/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without) doesn't seem like it would be useful for improving the reliability of self-driving cars, as likely self-driving cars aren't misaligned in the sense that they could drive perfectly safely but choose not to, but rather are just unable to drive perfectly safely because some of their internal (learned) systems aren't sufficiently robust.
Thanks for the comment. I'm inclined to disagree though. The application was for detecting deceptiveness. But the method was a form of contrastive probing. And one could use a similar approach for significantly different applications.
This feels kind of like a semantic disagreement to me. To ground it, it's probably worth considering whether further research on the CCS-style work I posted would also be useful for self-driving cards (or other applications). I think that would depend on whether the work improves the robustness of the contrastive probing regardless of what is being probed for (which would be generically useful), or whether it improves the probing specifically for truthfulness in systems that have a conception of truthfulness , possibly by improving the constraints or adding additional constraints (less useful for other systems). I think both would be good, but I'm uncertain which would be more useful to pursue if one is motivated by reducing x-risk from misalignment.
I see your point here about generally strengthening methods versus specifically refining them for an application. I think this is a useful distinction. But I'm all about putting a good amount of emphasis on connections between this work and different applications. I feel like at this point, we have probably cruxed.
Not Paul, but some possibilities why ARC's work wouldn't be relevant for self-driving cars: